NEAR-OPTIMAL DEPLOYMENT EFFICIENCY IN REWARD-FREE REINFORCEMENT LEARNING WITH LINEAR FUNCTION APPROXIMATION

Abstract

We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the reward-free exploration setting. This is a well-motivated problem because deploying new policies is costly in real-life RL applications. Under the linear MDP setting with feature dimension d and planning horizon H, we propose a new algorithm that collects at most O( d 2 H 5 ϵ 2 ) trajectories within H deployments to identify ϵ-optimal policy for any (possibly data-dependent) choice of reward functions. To the best of our knowledge, our approach is the first to achieve optimal deployment complexity and optimal d dependence in sample complexity at the same time, even if the reward is known ahead of time. Our novel techniques include an exploration-preserving policy discretization and a generalized G-optimal experiment design, which could be of independent interest. Lastly, we analyze the related problem of regret minimization in low-adaptive RL and provide information-theoretic lower bounds for switching cost and batch complexity.

1. INTRODUCTION

In many practical reinforcement learning (RL) based tasks, limited computing resources hinder applications of fully adaptive algorithms that frequently deploy new exploration policy. Instead, it is usually cheaper to collect data in large batches using the current policy deployment. Take recommendation system (Afsar et al., 2021) as an instance, the system is able to gather plentiful new data in very short time, while the deployment of a new policy often takes longer time, as it requires extensive computing and human resources. Therefore, it is impractical to switch the policy based on instantaneous data as a typical RL algorithm would demand. A feasible alternative is to run a large batch of experiments in parallel and only decide whether to update the policy after the whole batch is complete. The same constraint also appears in other RL applications such as healthcare (Yu et al., 2021) , robotics (Kober et al., 2013) and new material design (Zhou et al., 2019) . In those scenarios, the agent needs to minimize the number of policy deployment while learning a good policy using (nearly) the same number of trajectories as its fully-adaptive counterparts. On the empirical side, Matsushima et al. (2020) first proposed the notion deployment efficiency. Later, Huang et al. (2022) formally defined deployment complexity. Briefly speaking, deployment complexity measures the number of policy deployments while requiring each deployment to have similar size. We measure the adaptivity of our algorithms via deployment complexity and leave its formal definition to Section 2. Under the purpose of deployment efficiency, the recent work by Qiao et al. (2022) designed an algorithm that could solve reward-free exploration in O(H) deployments. However, their sample complexity O(|S| 2 |A|H 5 /ϵ 2 ), although being near-optimal under the tabular setting, can be unacceptably large under real-life applications where the state space is enormous or continuous. For environments with large state space, function approximations are necessary for representing the feature of each state. Among existing work that studies function approximation in RL, linear function approximation is arguably the simplest yet most fundamental setting. In this paper, we study deployment efficient RL with linear function approximation under the reward-free setting, and we consider the following question: Question 1.1. Is it possible to design deployment efficient and sample efficient reward-free RL algorithms with linear function approximation? ) H Lower bound (Wagenmaker et al., 2022b) Ω( d 2 H 2 ϵ 2 ) N.A. Lower bound (Huang et al., 2022) If polynomial sample Ω(H) Table 1 : Comparison of our results (in blue) to existing work regarding sample complexity and deployment complexity. We highlight that our results match the best known results for both sample complexity and deployment complexity at the same time. ‡ : We ignore the lower order terms in sample complexity for simplicity. * : ν min is the problem-dependent reachability coefficient which is upper bounded by 1 and can be arbitrarily small. †: This work is done under tabular MDP and we transfer the O(HSA) switching cost to 2H deployments. ⋆: When both our algorithms are applied under tabular MDP, we can replace one d in sample complexity by S. Our contributions. In this paper, we answer the above question affirmatively by constructing an algorithm with near-optimal deployment and sample complexities. Our contributions are threefold. • A new layer-by-layer type algorithm (Algorithm 1) for reward-free RL that achieves deployment complexity of H and sample complexity of O( d 2 H 5 ϵ 2 ). Our deployment complexity is optimal while sample complexity has optimal dependence in d and ϵ. In addition, when applied to tabular MDP, our sample complexity (Theorem 7.1) recovers best known result O( S 2 AH 5 ϵ 2 ). • We generalize G-optimal design and select near-optimal policy via uniform policy evaluation on a finite set of representative policies instead of using optimism and LSVI. Such technique helps tighten our sample complexity and may be of independent interest. • We show that "No optimal-regret online learners can be deployment efficient" and deployment efficiency is incompatible with the highly relevant regret minimization setting. For regret minimization under linear MDP, we present lower bounds (Theorem 7.2 and 7.3) for other measurements of adaptivity: switching cost and batch complexity.

1.1. CLOSELY RELATED WORKS

There is a large and growing body of literature on the statistical theory of reinforcement learning that we will not attempt to thoroughly review. Detailed comparisons with existing work on reward-free RL (Wang et al., 2020; Zanette et al., 2020b; Wagenmaker et al., 2022b; Huang et al., 2022; Qiao et al., 2022) are given in Table 1 . For more discussion of relevant literature, please refer to Appendix A and the references therein. Notably, all existing algorithms under linear MDP either admit fully adaptive structure (which leads to deployment inefficiency) or suffer from sub-optimal sample complexity. In addition, when applied to tabular MDP, our algorithm has the same sample complexity and slightly better deployment complexity compared to Qiao et al. (2022) . The deployment efficient setting is slightly different from other measurements of adaptivity. The low switching setting (Bai et al., 2019) restricts the number of policy updates, while the agent can decide whether to update the policy after collecting every single trajectory. This can be difficult to implement in practical applications. A more relevant setting, the batched RL setting (Zhang et al., 2022) requires decisions about policy changes to be made at only a few (often predefined) checkpoints. Compared to batched RL, the requirement of deployment efficiency is stronger by requiring each deployment to collect the same number of trajectories. Therefore, deployment efficient algorithms are easier to deploy in parallel (see, e.g., Huang et al., 2022 , for a more elaborate discussion). Lastly, we remark that our algorithms also work under the batched RL setting by running in H batches. Technically, our method is inspired by optimal experiment design -a well-developed research area from statistics. In particular, a major technical contribution of this paper is to solve a variant of G-optimal experiment design while solving exploration in RL at the same time. Zanette et al. (2020b) ; Wagenmaker et al. (2022b) choose policy through online experiment design, i.e., running no-regret online learners to select policies adaptively for approximating the optimal design. Those online approaches, however, cannot be applied under our problem due to the requirement of deployment efficiency. To achieve deployment complexity of H, we can only deploy one policy for each layer, so we need to decide the policy based on sufficient exploration for only previous layers. Therefore, our approach requires offline experiment design and thus raises substantial technical challenge. A remark on technical novelty. The general idea behind previous RL algorithms with low adaptivity is optimism and doubling schedule for updating policies that originates from UCB2 (Auer et al., 2002) . The doubling schedule, however, can not provide optimal deployment complexity. Different from those approaches, we apply layer-by-layer exploration to achieve the optimal deployment complexity, and our approach is highly non-trivial. Since we can only deploy one policy for each layer, there are two problems to be solved: the existence of a single policy that can explore all directions of a specific layer and how to find such policy. We generalize G-optimal design to show the existence of such explorative policy. Besides, we apply exploration-preserving policy discretization for approximating our generalized G-optimal design. We leave detailed discussions about these techniques to Section 3.

2. PROBLEM SETUP

Notations. Throughout the paper, for n ∈ Z + , [n] = {1, 2, • • • , n}. We denote ∥x∥ Λ = √ x ⊤ Λx. For matrix X ∈ R d×d , ∥ • ∥ 2 , ∥ • ∥ F , λ min (•), λ max (•) denote the operator norm, Frobenius norm, smallest eigenvalue and largest eigenvalue, respectively. For policy π, E π and P π denote the expectation and probability measure induced by π under the MDP we consider. For any set U , ∆(U ) denotes the set of all possible distributions over U . In addition, we use standard notations such as O and Ω to absorb constants while O and Ω suppress logarithmic factors.

Markov Decision

Processes. We consider finite-horizon episodic Markov Decision Processes (MDP) with non-stationary transitions, denoted by a tuple M = (S, A, H, P h , r h ) (Sutton & Barto, 1998) , where S is the state space, A is the action space and H is the horizon. The non-stationary transition kernel has the form P h : S ×A×S → [0, 1] with P h (s ′ |s, a) representing the probability of transition from state s, action a to next state s ′ at time step h. In addition, r h (s, a) ∈ ∆([0, 1]) denotes the corresponding distribution of reward.foot_0 Without loss of generality, we assume there is a fixed initial state s 1 .foot_1 A policy can be seen as a series of mapping π = (π 1 , • • • , π H ), where each π h maps each state s ∈ S to a probability distribution over actions, i.e. π h : S → ∆(A), ∀ h ∈ [H]. A random trajectory (s 1 , a 1 , r 1 , • • • , s H , a H , r H , s H+1 ) is generated by the following rule: s 1 is fixed, a h ∼ π h (•|s h ), r h ∼ r h (s h , a h ), s h+1 ∼ P h (•|s h , a h ), ∀ h ∈ [H]. Q-values, Bellman (optimality) equations. Given a policy π and any h ∈ [H], the value function V π h (•) and Q-value function Q π h (•, •) are defined as: V π h (s) = E π [ H t=h r t |s h = s], Q π h (s, a) = E π [ H t=h r t |s h , a h = s, a], ∀ s, a ∈ S × A. Besides, the value function and Q-value function with respect to the optimal policy π ⋆ is denoted by , a) . In this work, we consider the reward-free RL setting, where there may be different reward functions. Therefore, we denote the value function of policy π with respect to reward r by V π (r). Similarly, V ⋆ (r) denotes the optimal value under reward function r. We say that a policy π is ϵ-optimal with respect to r if V π (r) ≥ V ⋆ (r) -ϵ. V ⋆ h (•) and Q ⋆ h (•, •). Then Bellman (optimality) equation follows ∀ h ∈ [H]: Q π h (s, a) = r h (s, a) + P h (•|s, a)V π h+1 , V π h = E a∼π h [Q π h ], Q ⋆ h (s, a) = r h (s, a) + P h (•|s, a)V ⋆ h+1 , V ⋆ h = max a Q ⋆ h (• Linear MDP (Jin et al., 2020b ). An episodic MDP (S, A, H, P, r) is a linear MDP with known feature map ϕ : S × A → R d if there exist H unknown signed measures µ h ∈ R d over S and H unknown reward vectors θ h ∈ R d such that P h (s ′ | s, a) = ⟨ϕ(s, a), µ h (s ′ )⟩ , r h (s, a) = ⟨ϕ(s, a), θ h ⟩ , ∀ (h, s, a, s ′ ) ∈ [H] × S × A × S. Without loss of generality, we assume ∥ϕ(s, a)∥ 2 ≤ 1 for all s, a; and for all h ∈ [H], ∥µ h (S)∥ 2 ≤ √ d, ∥θ h ∥ 2 ≤ √ d. For policy π, we define Λ π,h := E π [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ], the expected covariance matrix with respect to policy π and time step h (here s h , a h follows the distribution induced by policy π). Let λ ⋆ = min h∈[H] sup π λ min (Λ π,h ). We make the following assumption regarding explorability. Assumption 2.1 (Explorability of all directions). The linear MDP we have satisfies λ ⋆ > 0. We remark that Assumption 2.1 only requires the existence of a (possibly non-Markovian) policy to visit all directions for each layer and it is analogous to other explorability assumptions in papers about RL under linear representation (Zanette et al., 2020b; Huang et al., 2022; Wagenmaker & Jamieson, 2022) . In addition, the parameter λ ⋆ only appears in lower order terms of sample complexity bound and our algorithms do not take λ ⋆ as an input. Reward-Free RL. The reward-free RL setting contains two phases, the exploration phase and the planning phase. Different from PAC RLfoot_2 setting, the learner does not observe the rewards during the exploration phase. Besides, during the planning phase, the learner has to output a near-optimal policy for any valid reward functions. More specifically, the procedure is: 1. Exploration phase: Given accuracy ϵ and failure probability δ, the learner explores an MDP for K(ϵ, δ) episodes and collects the trajectories without rewards {s k h , a k h } (h,k)∈[H]×[K] . 2. Planning phase: The learner outputs a function π(•) which takes reward function as input. The function π(•) satisfies that for any valid reward function r, V π(r) (r) ≥ V ⋆ (r) -ϵ. The goal of reward-free RL is to design a procedure that satisfies the above guarantee with probability at least 1 -δ while collecting as few episodes as possible. According to the definition, any procedure satisfying the above guarantee is provably efficient for PAC RL setting. Deployment Complexity. In this work, we measure the adaptivity of our algorithm through deployment complexity, which is defined as: Definition 2.2 (Deployment complexity (Huang et al., 2022) ). We say that an algorithm has deployment complexity of M , if the algorithm is guaranteed to finish running within M deployments. In addition, the algorithm is only allowed to collect at most N trajectories during each deployment, where N should be fixed a priori and cannot change adaptively. We consider the deployment of non-Markovian policies (i.e. mixture of deterministic policies) (Huang et al., 2022) . The requirement of deployment efficiency is stronger than batched RL (Zhang et al., 2022) or low switching RL (Bai et al., 2019) , which makes deployment-efficient algorithms more practical in real-life applications. For detailed comparison between these definitions, please refer to Section 1.1 and Appendix A.

3. TECHNIQUE OVERVIEW

In order to achieve the optimal deployment complexity of H, we apply layer-by-layer exploration. More specifically, we construct a single policy π h to explore layer h based on previous data. Following the general methods in reward-free RL (Wang et al., 2020; Wagenmaker et al., 2022b) , we do exploration through minimizing uncertainty. As will be made clear in the analysis, given exploration dataset D = {s n h , a n h } h,n∈[H]×[N ] , the uncertainty of layer h with respect to policy π can be characterized by E π ∥ϕ(s h , a h )∥ Λ -1 h , where Λ h = I + N n=1 ϕ(s n h , a n h )ϕ(s n h , a n h ) ⊤ is (regularized and unnormalized) empirical covariance matrix. Note that although we can not directly optimize Λ h , we can maximize the expectation N π h • E π h [ϕ h ϕ ⊤ h ] (N π h is the number of trajectories we apply π h ) by optimizing the policy π h . Therefore, to minimize the uncertainty with respect to some policy set Π, we search for an explorative policy π 0 to minimize max π∈Π E π ϕ(s h , a h )(E π0 ϕ h ϕ ⊤ h ) -1 ϕ(s h , a h ).

3.1. GENERALIZED G-OPTIMAL DESIGN

For the minimization problem above, traditional G-optimal design handles the case where each deterministic policy π generates some ϕ π at layer h with probability 1 (i.e. we directly choose ϕ instead of choosing π), as is the case under deterministic MDP. However, traditional G-optimal design cannot tackle our problem since under general linear MDP, each π will generate a distribution over the feature space instead of a single feature vector. We generalize G-optimal design and show that for any policy set Π, the following Theorem 3.1 holds. More details are deferred to Appendix B. Theorem 3.1 (Informal version of Theorem B.1). If there exists policy π 0 ∈ ∆(Π) such that λ min (E π0 ϕ h ϕ ⊤ h ) > 0, then min π0∈∆(Π) max π∈Π E π ϕ(s h , a h )(E π0 ϕ h ϕ ⊤ h ) -1 ϕ(s h , a h ) ≤ d. Generally speaking, Theorem 3.1 states that for any Π, there exists a single policy from ∆(Π) (i.e., mixture of several policies in Π) that can efficiently reduce the uncertainty with respect to Π. Therefore, assume we want to minimize the uncertainty with respect to Π and we are able to derive the solution π 0 of the minimization above, we can simply run π 0 repeatedly for several episodes. However, there are two gaps between Theorem 3.1 and our goal of reward free RL. First, under the Reinforcement Learning setting, the association between policy π and the corresponding distribution of ϕ h is unknown, which means we need to approximate the above minimization. It can be done by estimating the two expectations and we leave the discussion to Section 3.3. The second gap is about choosing appropriate Π in Theorem 3.1, for which a natural idea is to use the set of all policies. It is however infeasible to simultaneously estimate the expectations for all π accurately. The size of {all policies} is infinity and ∆({all policies}) is even bigger. It seems intractable to control its complexity using existing uniform convergence techniques (e.g., a covering number argument).

3.2. DISCRETIZATION OF POLICY SET

The key realization towards a solution to the above problem is that we do not need to consider the set of all policies. It suffices to consider a smaller subset Π that is more amenable to an ϵ-net argument. This set needs to satisfy a few conditions. (1) Due to condition in Theorem 3.1, Π should contain explorative policies covering all directions. (2) Π should contain a representative policy set Π eval such that it contains a near-optimal policy for any reward function. (3) Since we apply offline experimental design via approximating the expectations, Π must be "small" enough for a uniform-convergence argument to work. We show that we can construct a finite set Π with |Π| being small enough while satisfying Condition (1) and (2). More specifically, given the feature map ϕ(•, •) and the desired accuracy ϵ, we can construct the explorative policy set Π exp ϵ such that log(|Π exp ϵ,h |) ≤ O(d 2 log(1/ϵ)), where Π exp ϵ,h is the policy set for layer h. In addition, when ϵ is small compared to λ ⋆ , we have sup π∈∆(Π exp ϵ ) λ min (E π ϕ h ϕ ⊤ h ) ≥ Ω( (λ ⋆ ) 2 d ), which verifies Condition (1). 4 Plugging in Π exp ϵ and approximating the minimization problem, after the exploration phase we will be able to estimate the value functions of all π ∈ Π exp ϵ accurately. It remains to check Condition (2) by formalizing the representative policy set discussed above. From Π exp ϵ , we can further select a subset, and we call it policies to evaluate: Π eval ϵ . It satisfies that log(|Π eval ϵ,h |) = O(d log(1/ϵ)) while for any possible linear MDP with feature map ϕ(•, •), Π eval ϵ is guaranteed to contain one ϵ-optimal policy. As a result, it suffices to estimate the value functions of all policies in Π eval ϵ and output the greedy one with the largest estimated value.foot_4 

3.3. NEW APPROACH TO ESTIMATE VALUE FUNCTION

Now that we have a discrete policy set, we still need to estimate the two expectations in Theorem 3.1. We design a new algorithm (Algorithm 4, details can be found in Appendix E) based on the technique of LSVI (Jin et al., 2020b) to estimate E π r(s h , a h ) given policy π, reward r and exploration data. Algorithm 4 can estimate the expectations accurately simultaneously for all π ∈ Π exp and r (that appears in the minimization problem) given sufficient exploration of the first h -1 layers. Therefore, under our layer-by-layer exploration approach, after adequate exploration for the first h -1 layers, Algorithm 4 provides accurate estimations for E π0 ϕ h ϕ ⊤ h and E π [ϕ(s h , a h ) ⊤ ( E π0 ϕ h ϕ ⊤ h ) -1 ϕ(s h , a h )]. As a result, the (1) we solve serves as an accurate approximation of the minimization problem in Theorem 3.1 and the solution π h of (1) is provably efficient in exploration. Finally, after sufficient exploration of all H layers, the last step is to estimate the value functions of all policies in Π eval . We design a slightly different algorithm (Algorithm 3, details in Appendix D) for this purpose. Based on LSVI, Algorithm 3 takes π ∈ Π eval and reward function r as input, and estimates V π (r) accurately given sufficient exploration for all H layers.

4. ALGORITHMS

In this section, we present our main algorithms. The algorithm for the exploration phase is Algorithm 1 which formalizes the ideas in Section 3, while the planning phase is presented in Algorithm 2.  = C2dι ε2 = C2d 2 H 4 ι 3 C 2 1 ϵ 2 . Dataset D = ∅. 3: for h = 1, 2, • • • , H do 4: Solve the following optimization problem. 5: π h = argmin π∈∆(Π exp ϵ/3 ) s.t. λmin( Σπ)≥C3d 2 Hει max π∈Π exp ϵ/3 E π ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h ) , where Exploration Phase. We apply layer-by-layer exploration and π h is the stochastic policy we deploy to explore layer h. For solving π h , we approximate generalized G-optimal design via (1). For each candidate π and π, we estimate the two expectations by calling EstimateER (Algorithm 4, details in Appendix E). EstimateER is a generic subroutine for estimating the value function under a particular reward design. We estimate the two expectations of interest by carefully choosing one specific reward design for each coordinate separately, so that the resulting value function provides an estimate to the desired quantity in that coordinate. 6 As mentioned above and will be made clear in the analysis, given adequate exploration of the first h -1 layers, all estimations will be accurate and the surrogate policy π h is sufficiently explorative for all directions at layer h. Σ π is E π [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] = EstimateER(π, ϕ(s, a)ϕ(s, a) ⊤ , A = 1, h, D, s 1 ), 7: E π ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h ) = EstimateER( π, ϕ(s, a) ⊤ (N • Σ π ) -1 ϕ(s, a), A = ε C2d 3 Hι 2 , h, D, The restriction on λ min ( Σ π ) is for technical reason only, and we will show that under the assumption in Theorem 5.1, there exists valid solution of (1). Lastly, we remark that solving (1) is inefficient in general. Detailed discussions about computation are deferred to Section 7.2. Algorithm 2 Find Near-Optimal Policy Given Reward Function (Planning) V π (r) = EstimateV(π, r, D, s 1 ). // Estimate value functions using Algorithm 3. 5: end for 6: π = arg max π∈Π eval ϵ/3 V π (r). // Output the greedy policy w.r.t V π (r). Planning Phase. The output dataset D from the exploration phase contains sufficient information for the planning phase. In the planning phase (Algorithm 2), we construct a set of policies to evaluate and repeatedly apply Algorithm 3 (in Appendix D) to estimate the value function of each policy given reward function. Finally, Algorithm 2 outputs the policy with the highest estimated value. Since D has acquired sufficient information, all possible estimations in line 4 are accurate. Together with the property that there exists near-optimal policy in Π eval ϵ/3 , we have that the output π is near-optimal.

5. MAIN RESULTS

In this section, we state our main results, which formalize the techniques and algorithmic ideas we discuss in previous sections. Theorem 5.1. We run Algorithm 1 to collect data and let Planning(•) denote the output of Algorithm 2. There exist universal constants C 1 , C 2 , C 3 , C 4 > 0foot_6 such that for any accuracy ϵ > 0 and failure probability δ > 0, as well as ϵ < H(λ ⋆ ) 2 C4d 7/2 log(1/λ ⋆ ) , with probability 1 -δ, for any feasible linear reward function r, Planning(r) returns a policy that is ϵ-optimal with respect to r. In addition, the deployment complexity of Algorithm 1 is H while the number of trajectories is O( d 2 H 5 ϵ 2 ). The proof of Theorem 5.1 is sketched in Section 6 with details in the Appendix. Below we discuss some interesting aspects of our results. Near optimal deployment efficiency. First, the deployment complexity of our Algorithm 1 is optimal up to a log-factor among all reward-free algorithms with polynomial sample complexity, according to a Ω(H/ log d (N H)) lower bound (Theorem B.3 of Huang et al. (2022) ). In comparison, the deployment complexity of RFLIN (Wagenmaker et al., 2022b ) can be the same as their sample complexity (also O(d 2 H 5 /ϵ 2 )) in the worst case. Near optimal sample complexity. Secondly, our sample complexity matches the best-known sample complexity O(d 2 H 5 /ϵ 2 ) (Wagenmaker et al., 2022b) of reward-free RL even when deployment efficiency is not needed. It is also optimal in parameter d and ϵ up to lower-order terms, when compared against the lower bound of Ω(d 2 H 2 /ϵ 2 ) (Theorem 2 of Wagenmaker et al. (2022b) ). Dependence on λ ⋆ . A striking difference of our result comparing to the closest existing work (Huang et al., 2022) is that the sample complexity is independent to the explorability parameter λ ⋆ in the small-ϵ regime. This is highly desirable because we only require a non-zero λ ⋆ to exist, and smaller λ ⋆ does not affect the sample complexity asymptotically. In addition, our algorithm does not take λ ⋆ as an input (although we admit that the theoretical guarantee only holds when ϵ is small compared to λ ⋆ ). In contrast, the best existing result (Algorithm 2 of Huang et al. (2022) ) requires the knowledge of explorability parameter ν minfoot_7 and a sample complexity of O(1/ϵ 2 ν 2 min ) for any ϵ > 0. We leave detailed comparisons with Huang et al. (2022) to Appendix G. Sample complexity in the large-ϵ regime. For the case when ϵ is larger than the threshold: H(λ ⋆ ) 2 C4d 7/2 log(1/λ ⋆ ) , we can run the procedure with ϵ = H(λ ⋆ ) 2 C4d 7/2 log(1/λ ⋆ ) , and the sample complexity will be O( d 9 H 3 (λ ⋆ ) 4 ). So the overall sample complexity for any ϵ > 0 can be bounded by O( d 2 H 5 ϵ 2 + d 9 H 3 (λ ⋆ ) 4 ). This effectively says that the algorithm requires a "Burn-In" period before getting non-trivial results. Similar limitations were observed for linear MDPs before (Huang et al., 2022; Wagenmaker & Jamieson, 2022) so it is not a limitation of our analysis. Comparison to Qiao et al. (2022) . Algorithm 4 (LARFE) of Qiao et al. (2022) tackles reward-free exploration under tabular MDP in O(H) deployments while collecting O( S 2 AH 5 ϵ 2 ) trajectories. We generalize their result to reward-free RL under linear MDP with the same deployment complexity. More importantly, although a naive instantiation of our main theorem to the tabular MDP only gives O( S 2 A 2 H 5 ϵ 2 ), a small modification to an intermediate argument gives the same O( S 2 AH 5 ϵ 2 ), which matches the best-known results for tabular MDP. More details will be discussed in Section 7.1.

6. PROOF SKETCH

In this part, we sketch the proof of Theorem 5.1. Notations ι, ε, C i (i ∈ [4]), Π exp , Π eval , Σ π and E π are defined in Algorithm 1. We start with the analysis of deployment complexity. Deployment complexity. Since for each layer h ∈ [H], we only deploy one stochastic policy π h for exploration, the deployment complexity is H. Next we focus on the sample complexity. Sample complexity. Our proof of sample complexity bound results from induction. With the choice of ε and N from Algorithm 1, suppose that Λ k h is empirical covariance matrix from data up to the k-th deployment 9 , we assume max π∈Π exp ϵ/3 E π [ h-1 h=1 ϕ(s h , a h ) ⊤ (Λ h-1 h ) -1 ϕ(s h , a h )] ≤ (h -1)ε holds and prove that with high probability, max π∈Π exp ϵ/3 E π [ ϕ(s h , a h ) ⊤ (Λ h h ) -1 ϕ(s h , a h )] ≤ ε. Note that the induction condition implies that the uncertainty for the first h -1 layers is small, we have the following key lemma that bounds the estimation error of Σ π from (1). Lemma 6.1. With high probability, for all π ∈ ∆(Π exp ϵ/3 ), ∥ Σ π -E π ϕ h ϕ ⊤ h ∥ 2 ≤ C3d 2 Hει 4 . According to our assumption on ϵ, the optimal policy for exploration π⋆ h 10 satisfies that λ min (E π⋆ h ϕ h ϕ ⊤ h ) ≥ 5C3d 2 Hει

4

. Therefore, π⋆ h is a feasible solution of (1) and it holds that: max π∈Π exp ϵ/3 E π ϕ(s h , a h ) ⊤ (N • Σπ h ) -1 ϕ(s h , a h ) ≤ max π∈Π exp ϵ/3 E π ϕ(s h , a h ) ⊤ (N • Σ π⋆ h ) -1 ϕ(s h , a h ) . Moreover, due to matrix concentration and the Lemma 6.1 we derive, we can prove that 11 In addition, similar to the estimation error of Σ π , the following lemma bounds the estimation error of ( 4 5 Σ π⋆ h ) -1 ≽ ( Σ π⋆ h ) -1 and (N • Σ π h ) -1 ≽ (2Λ h h ) -1 . E π [ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h )] from (1). Lemma 6.2. With high probability, for all π ∈ Π exp ϵ/3 , π ∈ ∆(Π exp ϵ/3 ) such that λ min ( Σ π ) ≥ C 3 d 2 Hει, E π ϕ(s h , a h ) ⊤ (N • Σπ) -1 ϕ(s h , a h ) -E π ϕ(s h , a h ) ⊤ (N • Σπ) -1 ϕ(s h , a h ) ≤ ε2 2d 2 ≤ ε2 8 . With all the conclusions above, we have (Σ π is short for E π [ϕ h ϕ ⊤ h ]): 3ε 2 8 ≥ 5d 4N + ε2 8 ≥ max π∈Π exp ϵ/3 E π [ϕ(s h , a h ) ⊤ ( 4N 5 • Σ π⋆ h ) -1 ϕ(s h , a h )] + ε2 8 ≥ max π∈Π exp ϵ/3 E π [ϕ(s h , a h ) ⊤ (N • Σ π⋆ h ) -1 ϕ(s h , a h )] + ε2 8 ≥ max π∈Π exp ϵ/3 E π [ϕ(s h , a h ) ⊤ (N • Σ π⋆ h ) -1 ϕ(s h , a h )] ≥ max π∈Π exp ϵ/3 E π [ϕ(s h , a h ) ⊤ (N • Σ π h ) -1 ϕ(s h , a h )] ≥ max π∈Π exp ϵ/3 E π [ϕ(s h , a h ) ⊤ (N • Σ π h ) -1 ϕ(s h , a h )] - ε2 8 ≥ max π∈Π exp ϵ/3 E π [ϕ(s h , a h ) ⊤ (2Λ h h ) -1 ϕ(s h , a h )] - ε2 8 ≥ 1 2 max π∈Π exp ϵ/3 E π ϕ(s h , a h ) ⊤ (Λ h h ) -1 ϕ(s h , a h ) 2 - ε2 8 . As a result, the induction holds. Together with the fact that Π eval ϵ/3 is subset of Π exp ϵ/3 , we have max π∈Π eval ϵ/3 E π [ H h=1 ϕ(s h , a h ) ⊤ (Λ h ) -1 ϕ(s h , a h )] ≤ Hε. We have the following lemma. Lemma 6.3. With high probability, for all π ∈ Π eval ϵ/3 and r, | V π (r) -V π (r)| ≤ O(H √ d) • Hε ≤ ϵ 3 . Finally, since Π eval ϵ/3 contains ϵ/3-optimal policy, the greedy policy with respect to V π (r) is ϵ-optimal. 9 Detailed definition is deferred to Appendix F.4. 10 Solution of the actual minimization problem, detailed definition in (39). 11 Σ π⋆ h = E π⋆ h [ϕ h ϕ ⊤ h ]. The proof is through direct calculation, details are deferred to Appendix F.6.

7. SOME DISCUSSIONS

In this section, we discuss some interesting extensions of our main results.

7.1. APPLICATION TO TABULAR MDP

Under the special case where the linear MDP is actually a tabular MDP and the feature map is canonical basis (Jin et al., 2020b) , our Algorithm 1 and 2 are still provably efficient. Suppose the tabular MDP has discrete state-action space with cardinality |S| = S, |A| = A, let d m = min h sup π min s,a d π h (s, a) > 0 where d π h is occupancy measure, then the following theorem holds. Theorem 7.1 (Informal version of Theorem H.2). With minor revision to Algorithm 1 and 2, when ϵ is small compared to d m , our algorithms can solve reward-free exploration under tabular MDP within H deployments and the sample complexity is bounded by O( S 2 AH 5 ϵ 2 ). The detailed version and proof of Theorem 7.1 are deferred to Appendix H.1 due to space limit. We highlight that we recover the best known result from Qiao et al. (2022) under mild assumption about reachability to all (state,action) pairs. The replacement of one d by S is mainly because under tabular MDP, there are A S different deterministic policies for layer h and the log-covering number of Π eval h can be improved from O(d) to O(S). In this way, we effectively save a factor of A.

7.2. COMPUTATIONAL EFFICIENCY

We admit that solving the optimization problem ( 1) is inefficient in general, while this can be solved approximately in exponential time by enumerating π from a tight covering set of ∆(Π exp ϵ/3 ). Note that the issue of computational tractability arises in many previous works (Zanette et al., 2020a; Wagenmaker & Jamieson, 2022 ) that focused on information-theoretic results under linear MDP, and such issue is usually not considered as a fundamental barrier. For efficient surrogate of (1), we remark that a possible method is to apply softmax (or other differentiable) representation of the policy space and use gradient-based optimization techniques to find approximate solution of (1).

7.3. POSSIBLE EXTENSIONS TO REGRET MINIMIZATION WITH LOW ADAPTIVITY

In this paper, we tackle the problem of deployment efficient reward-free exploration while the optimal adaptivity under regret minimization still remains open. We remark that deployment complexity is not an ideal measurement of adaptivity for this problem since the definition requires all deployments to have similar sizes, which forces the deployment complexity to be Ω( √ T ) if we want regret bound of order O( √ T ). Therefore, the more reasonable task is to design algorithms with near optimal switching cost or batch complexity. We present the following two lower bounds whose proof is deferred to Appendix H.2. Here the number of episodes is K and the number of steps T := KH. Theorem 7.2. For any algorithm with the optimal O( poly(d, H)T ) regret bound, the switching cost is at least Ω(dH log log T ). Theorem 7.3. For any algorithm with the optimal O( poly(d, H)T ) regret bound, the number of batches is at least Ω( H log d T + log log T ). To generalize our Algorithm 1 to regret minimization, what remains is to remove Assumption 2.1. Suppose we can do accurate uniform policy evaluation (as in Algorithm 2) with low adaptivity without assumption on explorability of policy set, then we can apply iterative policy elimination (i.e., eliminate the policies that are impossible to be optimal) and do exploration with the remaining policies. Although Assumption 2.1 is common in relevant literature, it is not necessary intuitively since under linear MDP, if some direction is hard to encounter, we do not necessarily need to gather much information on this direction. Under tabular MDP, Qiao et al. (2022) applied absorbing MDP to ignore those "hard to visit" states and we leave generalization of such idea as future work.

8. CONCLUSION

In this work, we studied the well-motivated deployment efficient reward-free RL with linear function approximation. Under the linear MDP model, we designed a novel reward-free exploration algorithm that collects O( d 2 H 5 ϵ 2 ) trajectories in only H deployments. And both the sample and deployment complexities are near optimal. An interesting future direction is to design algorithms to match our lower bounds for regret minimization with low adaptivity. We believe the techniques we develop (generalized G-optimal design and exploration-preserving policy discretization) could serve as basic building blocks and we leave the generalization as future work. Low regret reinforcement learning algorithms. Regret minimization under tabular MDP has been extensively studied by a long line of works (Brafman & Tennenholtz, 2002; Kearns & Singh, 2002; Jaksch et al., 2010; Osband et al., 2013; Agrawal & Jia, 2017; Jin et al., 2018)  √ d 2 H 2 T ) via a computationally efficient algorithm. There are some other works studying the linear mixture MDP setting (Ayoub et al., 2020; Zhou et al., 2021; Zhang et al., 2021b) or more general settings like MDP with low Bellman Eluder dimension (Jin et al., 2021) . Reward-free exploration. Jin et al. (2020a) first studied the problem of reward-free exploration, they designed an algorithm while using EULER (Zanette & Brunskill, 2019) for exploration and arrived at the sample complexity of O(S 2 AH 5 /ϵ 2 ). This sample complexity was improved by Kaufmann et al. (2021) to O(S 2 AH 4 /ϵ 2 ) by building upper confidence bound for any reward function and any policy. Finally, minimax optimal result O(S 2 AH 3 /ϵ 2 ) was derived in Ménard et al. (2021) by constructing a novel exploration bonus. At the same time, a more general optimal result was achieved by Zhang et al. (2020b) who considered MDP with stationary transition kernel and uniformly bounded reward. Zhang et al. (2020a) studied a similar setting named task-agnostic exploration and designed an algorithm that can find ϵ-optimal policies for N arbitrary tasks after at most O(SAH 5 log N/ϵ 2 ) episodes. For linear MDP setting, Wang et al. (2020) generalized LSVI-UCB and arrived at the sample complexity of O(d 3 H 6 /ϵ 2 ). The sample complexity was improved by Zanette et al. (2020b) to O(d 3 H 5 /ϵ 2 ) through approximating G-optimal design. Recently, Wagenmaker et al. (2022b) did exploration through applying first-order regret algorithm (Wagenmaker et al., 2022a) and achieved sample complexity bound of O(d 2 H 5 /ϵ 2 ), which matches their lower bound Ω(d 2 H 2 /ϵ 2 ) up to H factors. There are other reward-free works under linear mixture MDP (Chen et al., 2021; Zhang et al., 2021a) . Meanwhile, there is a new setting that aims to do reward-free exploration under low adaptivity and Huang et al. (2022) ; Qiao et al. (2022) designed provably efficient algorithms for linear MDP and tabular MDP, respectively. Low switching algorithms for bandits and RL. There are two kinds of switching costs. Global switching cost simply measures the number of policy switches, while local switching cost is defined (only under tabular MDP) as Batched bandits and RL. In batched bandits problems, the agent decides a sequence of arms and observes the reward of each arm after all arms in that sequence are pulled. More formally, at the beginning of each batch, the agent decides a list of arms to be pulled. Afterwards, a list of (arm,reward) pairs is given to the agent. Then the agent decides about the next batch (Esfandiari et al., 2021) . The batch sizes could be chosen non-adaptively or adaptively. In a non-adaptive algorithm, the batch sizes should be decided before the algorithm starts, while in an adaptive algorithm, the batch sizes may depend on the previous observations. Under multi-armed bandits with A arms and T episodes, Cesa-Bianchi et al. ( 2013 2021) improved this result by using weaker assumptions. For batched RL setting, Qiao et al. (2022) showed that their algorithm uses the optimal O(H + log log T ) batches to achieve the optimal O( √ T ) regret. Recently, the regret bound and computational efficiency is improved by Zhang et al. (2022) through incorporating the idea of optimal experimental design. The deployment efficient algorithms for pure exploration by Huang et al. (2022) also satisfy the definition of batched RL. N local switch = K-1 k=1 |{(h, s) ∈ [H] × S : π h k (s) ̸ = π h k+1

B GENERALIZATION OF G-OPTIMAL DESIGN

Traditional G-optimal design. We first briefly introduce the problem setup of G-optimal design. Assume there is some (possibly infinite) set A ⊆ R d , let π : A → [0, 1] be a distribution on A so that a∈A π(a) = 1. V (π) ∈ R d×d and g(π) ∈ R are given by V (π) = a∈A π(a)aa ⊤ , g(π) = max a∈A ∥a∥ 2 V (π) -1 . The problem of finding a design π that minimises g(π) is called the G-optimal design problem. G-optimal design has wide application in regression problems and it can solve the linear bandit problem (Lattimore & Szepesvári, 2020) . However, traditional G-optimal design can not tackle our problem under linear MDP where we can only choose π instead of choosing the feature vector ϕ directly. In this section, we generalize the well-known G-optimal design for our purpose under linear MDP. Consider the following problem: Under some fixed linear MDP, given a fixed finite policy set Π, we want to select a policy π 0 from ∆(Π) (distribution over policy set Π) to minimize the following term: max π∈Π E π ϕ(s h , a h ) ⊤ (E π0 ϕ h ϕ ⊤ h ) -1 ϕ(s h , a h ), where the s h , a h follows the distribution according to π and the ϕ h follows the distribution of policy π 0 . We first consider its two special cases. Special case 1. If the MDP is deterministic, then given any fixed deterministic policy π, the trajectory generated from this π is deterministic. Therefore the feature ϕ h at layer h is also deterministic. We denote the feature at layer h from running policy π by ϕ π,h . In this case, the previous problem (2) reduces to min π0∈∆(Π) max π∈Π ϕ ⊤ π,h (E π0 ϕ h ϕ ⊤ h ) -1 ϕ π,h , which can be characterized by the traditional G-optimal design, for more details please refer to Kiefer & Wolfowitz (1960) and chapter 21 of Lattimore & Szepesvári (2020) . According to Theorem 21.1 of Lattimore & Szepesvári (2020) , the minimization of (3) can be bounded by d, which is the dimension of the feature map ϕ. Different from these two special cases, under our problem setup (general linear MDP), the feature map can be much more complex than canonical basis and running each π will lead to a distribution over the feature map space rather than a fixed single feature. Next, we formalize the problem setup and present the theorem. We are given finite policy set Π and finite action set Φ (we only consider finite action set, the general case can be proven similarly by passing to the limit (Lattimore & Szepesvári, 2020)), where each π ∈ Π is a distribution over Φ (with π(a) denoting the probability of choosing action a) and each action a ∈ Φ is a vector in R d . In addition, µ can be any distribution over Π. In the following part, we characterize µ as a vector in R |Π| with µ(π) denoting the probability of choosing policy π.

Special case

Let Λ(π) = a∈Φ π(a)aa ⊤ and V (µ) = π∈Π µ(π)Λ(π) = π∈Π µ(π) a∈Φ π(a)aa ⊤ . The function we want to minimize is g(µ) = max π∈Π a∈Φ π(a)a ⊤ V (µ) -1 a. Theorem B.1. Define the set Φ = {a ∈ Φ : ∃ π ∈ Π, π(a) > 0}. If span( Φ) = R d , there exists a distribution µ ⋆ over Π such that g(µ ⋆ ) ≤ d. Proof of Theorem B.1. Define f (µ) = log det V (µ) and take µ ⋆ to be µ ⋆ = arg max µ f (µ). According to Exercise 21.2 of Lattimore & Szepesvári (2020), f is concave. Besides, according to Exercise 21.1 of Lattimore & Szepesvári (2020), we have d dt log det(A(t)) = 1 det(A(t)) T r(adj(A) d dt A(t)) = T r(A -1 d dt A(t)). Plugging f in, we directly have: (▽f (µ)) π = T r(V (µ) -1 Λ(π)) = a∈Φ π(a)a ⊤ V (µ) -1 a. In addition, by direct calculation, for any feasible µ, π∈Π µ(π)(▽f (µ)) π = T r( π∈Π µ(π) a∈Φ π(a)aa ⊤ V (µ) -1 ) = T r(I d ) = d. Since µ ⋆ is the maximizer of f , by first order optimality criterion, for any feasible µ, 0 ≥⟨▽f (µ ⋆ ), µ -µ ⋆ ⟩ = π∈Π µ(π) a∈Φ π(a)a ⊤ V (µ ⋆ ) -1 a - π∈Π µ ⋆ (π) a∈Φ π(a)a ⊤ V (µ ⋆ ) -1 a = π∈Π µ(π) a∈Φ π(a)a ⊤ V (µ ⋆ ) -1 a -d. For any π ∈ Π, we can choose µ to be Dirac at π, which proves that for any π ∈ Π, a∈Φ π(a)a ⊤ V (µ ⋆ ) -1 a ≤ d. Due to the definition of g(µ ⋆ ), we have g(µ ⋆ ) ≤ d. Remark B.2. By replacing the action set Φ with the set of all feasible features at layer h, Theorem B.1 shows that for any linear MDP and fixed policy set Π, min π0∈∆(Π) max π∈Π E π ϕ(s h , a h ) ⊤ (E π0 ϕ h ϕ ⊤ h ) -1 ϕ(s h , a h ) ≤ d. This theorem serves as one of the critical theoretical bases for our analysis. Remark B.3. Although the proof is similar to Theorem 21.1 of Lattimore & Szepesvári (2020), our Theorem B.1 is more general since it also holds under the case where each π will generate a distribution over the action space. In contrast, G-optimal design is a special case of our setting where each π will generate a fixed action from the action space. Knowing the existence of such covering policy, the next lemma provides some properties of the solution of (2) under some additional assumption. Lemma B.4. Let π ⋆ = arg min π0∈∆(Π) max π∈Π E π ϕ(s h , a h ) ⊤ (E π0 ϕ h ϕ ⊤ h ) -1 ϕ(s h , a h ). Assume that sup π∈∆(Π) λ min (E π ϕ h ϕ ⊤ h ) ≥ λ ⋆ , then it holds that λ min (E π ⋆ ϕ h ϕ ⊤ h ) ≥ λ ⋆ d , ( ) where d is the dimension of ϕ and λ min denotes the minimum eigenvalue. Before we state the proof, we provide the description of the special case where the MDP is a tabular MDP. The condition implies that there exists some policy π ∈ ∆(Π) such that for any s, a ∈ S × A, d π h (s, a) ≥ λ ⋆ , where d π h (•, •) is occupancy measure. Due to Theorem B.1, π ⋆ satisfies that max π∈Π (s,a)∈S×A d π h (s, a) d π ⋆ h (s, a) ≤ SA. For any (s, a) ∈ S × A, choose π s,a = arg max π∈Π d π h (s, a) and d πs,a h (s, a) ≥ d π h (s, a) ≥ λ ⋆ . Therefore, it holds that d π ⋆ h (s, a) ≥ λ ⋆ SA for any s, a, which is equivalent to the conclusion of (6). Proof of Lemma B.4. If the conclusion (6) does not hold, we have λ min (E π ⋆ ϕ h ϕ ⊤ h ) < λ ⋆ d , which implies that λ max ((E π ⋆ ϕ h ϕ ⊤ h ) -1 ) > d λ ⋆ . Denote the eigenvalues of (E π ⋆ ϕ h ϕ ⊤ h ) -1 by 0 < λ 1 ≤ λ 2 ≤ • • • ≤ λ d . There exists a set of orthogonal and normalized vectors { φi } i∈ [d] such that φi is a corresponding eigenvector of λ i . According to the condition, there exists π ∈ ∆(Π) such that λ min (E π ϕ h ϕ ⊤ h ) ≥ λ ⋆ . Therefore, for any ϕ ∈ R d with ∥ϕ∥ 2 = 1, ϕ ⊤ (E π ϕ h ϕ ⊤ h )ϕ = E π (ϕ ⊤ h ϕ) 2 ≥ λ ⋆ . Now we consider E π ϕ ⊤ h (E π ⋆ ) -1 ϕ h , where E π ⋆ is short for E π ⋆ ϕ h ϕ ⊤ h . It holds that: E π ϕ ⊤ h (E π ⋆ ) -1 ϕ h =E π [ d i=1 (ϕ ⊤ h φi ) φi ] ⊤ (E π ⋆ ) -1 [ d i=1 (ϕ ⊤ h φi ) φi ] =E π d i=1 (ϕ ⊤ h φi ) 2 φ⊤ i (E π ⋆ ) -1 φi ≥E π (ϕ ⊤ h φd ) 2 φ⊤ d (E π ⋆ ) -1 φd >λ ⋆ × d λ ⋆ = d, where the first equation is due to the fact that { φi } i∈[d] forms a set of normalized basis. The second equation results from the definition of eigenvectors. The last inequality is because our assumption (λ max ((E π ⋆ ϕ h ϕ ⊤ h ) -1 ) > d λ ⋆ ) and condition (∀ ∥ϕ∥ 2 = 1, ϕ ⊤ (E π ϕ h ϕ ⊤ h )ϕ = E π (ϕ ⊤ h ϕ) 2 ≥ λ ⋆ ). Finally, since this leads to contradiction with Theorem B.1, the proof is complete.

C CONSTRUCTION OF POLICY SETS

In this section, we construct policy sets given the feature map ϕ(•, •). We begin with several technical lemmas.

C.1 TECHNICAL LEMMAS

Lemma C.1 (Covering Number of Euclidean Ball (Jin et al., 2020b) ). For any ϵ > 0, the ϵ-covering number of the Euclidean ball in R d with radius R > 0 is upper bounded by (1 Jin et al. (2020b) ). Let w π h denote the set of weights such that + 2R ϵ ) d . Lemma C.2 (Lemma B.1 of Q π h (s, a) = ⟨ϕ(s, a), w π h ⟩. Then ∥w π h ∥ 2 ≤ 2H √ d. Lemma C.3 (Advantage Decomposition). For any MDP with fixed initial state s 1 , for any policy π, it holds that V ⋆ 1 (s 1 ) -V π 1 (s 1 ) = E π H h=1 [V ⋆ h (s h ) -Q ⋆ h (s h , a h )], here the expectation means that s h , a h follows the distribution generated by π. Proof of Lemma C.3. V ⋆ 1 (s 1 ) -V π 1 (s 1 ) =E π [V ⋆ 1 (s 1 ) -Q ⋆ 1 (s 1 , a 1 )] + E π [Q ⋆ 1 (s 1 , a 1 ) -Q π 1 (s 1 , a 1 )] =E π [V ⋆ 1 (s 1 ) -Q ⋆ 1 (s 1 , a 1 )] + E s1,a1∼π [ s ′ ∈S P 1 (s ′ |s 1 , a 1 )(V ⋆ 2 (s ′ ) -V π 2 (s ′ ))] =E π [V ⋆ 1 (s 1 ) -Q ⋆ 1 (s 1 , a 1 )] + E s2∼π [V ⋆ 2 (s 2 ) -V π 2 (s 2 )] = • • • =E π H h=1 [V ⋆ h (s h ) -Q ⋆ h (s h , a h )], where the second equation is because of Bellman Equation and the forth equation results from applying the decomposition recursively from h = 1 to H.  M 0 = I, • • • , M t = M t-1 + X t . Then T t=1 T r(X t M -1 t-1 ) ≤ 2d log(1 + T d ).

C.2 CONSTRUCTION OF POLICIES TO EVALUATE

We construct the policy set Π eval given feature map ϕ(•, •). The policy set Π eval satisfies that for any feasible linear MDP with feature map ϕ, Π eval contains one near-optimal policy of this linear MDP. We begin with the construction. Construction of Π eval . Given ϵ > 0, let W be a ϵ 2H -cover of the Euclidean ball B d (2H √ d) := {x ∈ R d : ∥x∥ 2 ≤ 2H √ d}. Next, we construct the Q-function set Q = { Q(s, a) = ϕ(s, a) ⊤ w : w ∈ W}.

Then the policy set at layer h is defined as

∀ h ∈ [H], Π h = {π(s) = arg max a∈A Q(s, a)| Q ∈ Q}, with ties broken arbitrarily. Finally, the policy set Π eval ϵ is Π eval ϵ = Π 1 × Π 2 × • • • × Π H . Lemma C.5. The policy set Π eval ϵ satisfies that for any h ∈ [H], log |Π h | ≤ d log(1 + 8H 2 √ d ϵ ) = O(d). In addition, for any linear MDP with feature map ϕ(•, •), there exists π = (π 1 , π 2 , • • • , π H ) such that π h ∈ Π h for all h ∈ [H] and V π ≥ V ⋆ -ϵ. Proof of Lemma C.5. Since W is a ϵ 2H -covering of Euclidean ball, by Lemma C.1 we have log |W| ≤ d log(1 + 8H 2 √ d ϵ ). In addition, for any w in W, there is at most one corresponding Q ∈ Q and one π h ∈ Π h . Therefore, it holds that for any h ∈ [H], log |Π h | ≤ log |Q| ≤ log |W| ≤ d log(1 + 8H 2 √ d ϵ ). For any linear MDP, according to Lemma C.2, the optimal Q-function can be written as: Q ⋆ h (s, a) = ⟨ϕ(s, a), w ⋆ h ⟩, with ∥w ⋆ h ∥ 2 ≤ 2H √ d. Since W is ϵ 2H -covering of the Euclidean ball, for any h ∈ [H] there exists wh ∈ W such that ∥ wh -w ⋆ h ∥ 2 ≤ ϵ 2H . Select Qh (s, a) = ϕ(s, a) ⊤ wh from Q and π h (s) = arg max a∈A Qh (s, a) from Π h . Note that for any h, s, a ∈ [H] × S × A, |Q ⋆ h (s, a) -Qh (s, a)| ≤ ∥ϕ(s, a)∥ 2 • ∥w ⋆ h -wh ∥ 2 ≤ ϵ 2H . ( ) Let π = (π 1 , π 2 , • • • , π H ), now we prove that this π is ϵ-optimal. Denote the optimal policy under this linear MDP by π ⋆ , then we have for any s, h ∈ S × [H], Q ⋆ h (s, π ⋆ h (s)) -Q ⋆ h (s, π h (s)) =[Q ⋆ h (s, π ⋆ h (s)) -Qh (s, π ⋆ h (s))] + [ Qh (s, π ⋆ h (s)) -Qh (s, π h (s))] + [ Qh (s, π h (s)) -Q ⋆ h (s, π h (s))] ≤ ϵ 2H + 0 + ϵ 2H = ϵ H , where the inequality results from the definition of π h and ( 8). Now we apply the advantage decomposition (Lemma C.3), it holds that: V ⋆ 1 (s 1 ) -V π 1 (s 1 ) =E π H h=1 [V ⋆ h (s h ) -Q ⋆ h (s h , a h )] ≤H • ϵ H = ϵ, where the inequality comes from (9). Remark C.6. Our concurrent work Wagenmaker & Jamieson (2022) also applies the idea of policy discretization. However, to cover ϵ-optimal policies of all linear MDPs, the size of their policy set is log |Π ϵ | ≤ O(dH 2 • log 1 ϵ ) (stated in Corollary 1 of Wagenmaker & Jamieson (2022)). In comparison, our Π eval ϵ satisfies that log |Π eval ϵ | ≤ H log |Π 1 | ≤ O(dH • log 1 ϵ ) , which improves their results by a factor of H. Such improvement is done by applying advantage decomposition. Finally, by plugging in our Π eval ϵ into Corollary 2 of Wagenmaker & Jamieson (2022) , we can directly improve their worst-case bound by a factor of H.

C.3 CONSTRUCTION OF EXPLORATIVE POLICIES

Given the feature map ϕ(•, •) and the condition that for any h ∈ [H], sup π λ min (E π ϕ h ϕ ⊤ h ) ≥ λ ⋆ where π can be any policy, we construct a finite policy set Π exp that covers explorative policies under any feasible linear MDPs. Such exploratory is formalized as for any linear MDP and h ∈ [H], there exists some policy π in ∆(Π exp ) such that λ min (E π ϕ h ϕ ⊤ h ) is large enough. We begin with the construction. Construction of Π exp . Given ϵ > 0, consider all reward functions that can be represented as r(s, a) = ϕ(s, a) ⊤ (I + Σ) -1 ϕ(s, a), ( ) where Σ is positive semi-definite. According to Lemma D.6 of Jin et al. (2020b) , we can construct a ϵ 2H -cover R ϵ of all such reward functions while the size of R ϵ satisfies log |R ϵ | ≤ d 2 log(1 + 32H 2 √ d ϵ 2 ).

For all h ∈ [H], denote Π 1

h,ϵ = {π(s) = arg max a∈A r(s, a)|r ∈ R ϵ } with ties broken arbitrarily. Meanwhile, denote the policy set Π h (w.r.t ϵ) in the previous Section C.2 by Π 2 h,ϵ . Finally, let Π h,ϵ = Π 1 h,ϵ ∪ Π 2 h,ϵ be the policy set for layer h. The whole policy set is the product of these h policy sets, Π exp ϵ = Π 1,ϵ × • • • × Π H,ϵ . Lemma C.7. For any ϵ > 0, we have Π eval ϵ ⊆ Π exp ϵ . In addition, log |Π h,ϵ | ≤ 2d 2 log(1 + 32H 2 √ d ϵ 2 ). For any reward r that is the form of (10) and h ∈ [H], there exists a policy π ∈ Π exp ϵ such that E π r(s h , a h ) ≥ sup π E π r(s h , a h ) -ϵ. Proof of Lemma C.7. The conclusion that Π eval ϵ ⊆ Π exp ϵ is because of our construction: Π h,ϵ = Π 1 h,ϵ ∪ Π 2 h,ϵ . In addition, log |Π h,ϵ | ≤ log |Π 1 h,ϵ | + log |Π 2 h,ϵ | ≤ log |R ϵ | + d log(1 + 8H 2 √ d ϵ ) ≤ 2d 2 log(1 + 32H 2 √ d ϵ 2 ). Consider the optimal Q-function under reward function r(s h , a h ) (reward is always 0 at other layers). We have Q ⋆ h (s, a) = r(s, a) and for i ≤ h -1, Q ⋆ i (s, a) =0 + s ′ ∈S ⟨ϕ(s, a), µ i (s ′ )⟩V ⋆ i+1 (s ′ ) =⟨ϕ(s, a), s ′ ∈S µ i (s ′ )V ⋆ i+1 (s ′ )⟩ =⟨ϕ(s, a), w ⋆ i ⟩, for some w ⋆ i ∈ R d with ∥w ⋆ i ∥ 2 ≤ 2 √ d. The first equation is because of Bellman Equation and our design of reward function. Since Q ⋆ h is covered by R ϵ up to ϵ 2H accuracy while Q ⋆ i (i ≤ h -1) is covered by Q in section C.2 up to ϵ 2H accuracy, with identical proof to Lemma C.5, the last conclusion holds. Lemma C.8. Assume sup π λ min (E π ϕ h ϕ ⊤ h ) ≥ λ ⋆ , if ϵ ≤ λ ⋆ 4 , we have sup π∈∆(Π exp ϵ ) λ min (E π ϕ h ϕ ⊤ h ) ≥ (λ ⋆ ) 2 64d log(1/λ ⋆ ) . Proof of Lemma C.8. Fix t = 64d log(1/λ ⋆ ) (λ ⋆ ) 2 , we construct the following policies: π 1 is arbitrary policy in Π exp ϵ . For any i ∈ [t], Σ i = i j=1 E πj ϕ h ϕ ⊤ h , r i (s, a) = ϕ(s, a) ⊤ (I + Σ i ) -1 ϕ(s, a). Due to Lemma C.7, there exists policy π i+1 ∈ Π exp ϵ such that E πi+1 r i (s h , a h ) ≥ sup π E π r i (s h , a h ) -ϵ. The following inequality holds: t i=1 E πi ϕ ⊤ h (I + Σ i-1 ) -1 ϕ h ≤ t i=1 E πi ϕ ⊤ h (I + Σ i-1 ) -1 ϕ h ≤ t • t i=1 E πi ϕ ⊤ h (I + Σ i-1 ) -1 ϕ h ≤ t • t i=1 T r(E πi ϕ h ϕ ⊤ h (I + Σ i-1 ) -1 ) ≤ 2dt log(1 + t d ), where the second inequality holds because of Cauchy-Schwarz inequality and the last inequality holds due to Lemma C.4. Therefore, we have that sup π E π ϕ ⊤ h (I + Σ t-1 ) -1 ϕ h ≤ 2d log(1+t/d) t + ϵ ≤ λ ⋆ 2 because of our choice of ϵ ≤ λ ⋆ 4 and t = 64d log(1/λ ⋆ ) (λ ⋆ ) 2 . According to Lemma E.14foot_8 of Huang et al. (2022) , we have that λ min (Σ t-1 ) ≥ 1. We compare the relationship between different policy sets in the Table 2 above. In summary, given the feature map ϕ(•, •) of linear MDP and any accuracy ϵ, we can construct policy set Π eval which satisfies that log |Π eval h | = O(d). At the same time, for any linear MDP, the policy set Π eval is guaranteed to contain one near-optimal policy. Therefore, it suffices to estimate the value functions of all policies in Π eval accurately. Finally, choose π = unif ({π i } i∈[t-1] ), we have π ∈ ∆(Π exp ϵ ) and λ min (E π ϕ h ϕ ⊤ h ) ≥ (λ ⋆ ) 2 64d log(1/λ ⋆ ) . C.4 Similarly, given the feature map ϕ(•, •) and some ϵ that is small enough compared to λ ⋆ , we can construct policy set Π exp which satisfies that log |Π exp h | = O(d 2 ). At the same time, for any linear MDP, the policy set Π exp is guaranteed to contain explorative policies for all layers, which means that it suffices to do exploration using only policies from Π exp .

D ESTIMATION OF VALUE FUNCTIONS

According to the construction of Π eval in Section C.2 and Lemma C.5, it suffices to estimate the value functions of policies in Π eval . In this section, we design an algorithm to estimate the value functions of any policy in Π eval given any reward function. Recall that for accuracy ϵ 0 , we denote the policy set constructed in Section C.2 by Π eval ϵ0 and the policy set for layer h is denoted by Π eval ϵ0,h .

D.1 THE ALGORITHM

Algorithm 3 Estimation of V π (r) given exploration data (EstimateV) 1: Input: Policy to evaluate π ∈ Π eval ϵ0 . Linear reward function r = {r h } h∈[H] bounded in [0, 1]. Exploration data {s n h , a n h } (h,n)∈[H]×[N ] . Initial state s 1 . 2: Initialization: Q H+1 (•, •) ← 0, V H+1 (•) ← 0. 3: for h = H, H -1, . . . , 1 do 4: Λ h ← I + N n=1 ϕ(s n h , a n h )ϕ(s n h , a n h ) ⊤ . 5: wh ← (Λ h ) -1 N n=1 ϕ(s n h , a n h )V h+1 (s n h+1 ). 6: Q h (•, •) ← (ϕ(•, •) ⊤ wh + r h (•, •)) [0,H] . 7: V h (•) ← Q h (•, π h (•)). 8: end for 9: Output: V 1 (s 1 ). Algorithm 3 takes policy π from Π eval ϵ0 and linear reward function r as input, and uses LSVI to estimate the value function of this given policy and given reward function. From layer H to layer 1, we calculate Λ h and wh to estimate Q π h in line 6. In addition, according to our construction in Section C.2, all policies in Π eval ϵ0 are deterministic, which means we can use line 7 to approximate V π h . Algorithm 3 looks similar to Algorithm 2 in Wang et al. (2020) . However, there are two key differences. First, Algorithm 2 of Wang et al. (2020) aims to find near optimal policy for each reward function while we do policy evaluation for each reward and policy. In addition, different from their approach, we do not use optimism, which means we do not need to cover the bonus term. This is the main reason why we can save a factor of √ d.

D.2 TECHNICAL LEMMAS

Lemma D.1 (Lemma D.4 of Jin et al. (2020b) ). Let {x τ } ∞ τ =1 be a stochastic process on state space S with corresponding filtration {F τ } ∞ τ =0 . Let {ϕ τ } ∞ τ =1 be an R d -valued stochastic process where ϕ τ ∈ F τ -1 , and ∥ϕ τ ∥ ≤ 1. Let Λ k = I + k τ =1 ϕ τ ϕ ⊤ τ . Then for any δ > 0, with probability at least 1 -δ, for all k ≥ 0, and any V ∈ V so that sup x |V (x)| ≤ H, we have: k τ =1 ϕ τ {V (x τ ) -E[V (x τ )|F τ -1 ]} 2 Λ -1 k ≤ 4H 2 d 2 log(k + 1) + log( N ϵ δ ) + 8k 2 ϵ 2 , where N ϵ is the ϵ-covering number of V with respect to the distance dist(V, V ′ ) = sup x |V (x) - V ′ (x)|. Lemma D.2. The wh in line 5 of Algorithm 3 is always bounded by ∥ wh ∥ 2 ≤ H √ dN . Proof of Lemma D.2. For any θ ∈ R d with ∥θ∥ 2 = 1, we have |θ ⊤ wh | =|θ ⊤ (Λ h ) -1 N n=1 ϕ(s n h , a n h )V h+1 (s n h+1 )| ≤ N n=1 |θ ⊤ (Λ h ) -1 ϕ(s n h , a n h )| • H ≤H • [ N n=1 θ ⊤ (Λ h ) -1 θ] • [ N n=1 ϕ(s n h , a n h ) ⊤ (Λ h ) -1 ϕ(s h , a h )] ≤H √ dN . The second inequality is because of Cauchy-Schwarz inequality. The last inequality holds according to Lemma D.1 of Jin et al. (2020b) .

D.3 UPPER BOUND OF ESTIMATION ERROR

We first consider the covering number of V h in Algorithm 3. All V h can be written as: V h (•) = ϕ(•, π h (•)) ⊤ ( wh + θ h ) [0,H] , where θ h is the parameter with respect to r h (r h (s, a) = ⟨ϕ(s, a), θ h ⟩). Note that Π eval ϵ0,h × W ϵ (where W ϵ is ϵ-cover of B d (2H √ dN )) provides a ϵ-cover of {V h }. Therefore, the ϵ-covering number N ϵ of {V h } is bounded by log N ϵ ≤ log |Π eval ϵ0,h | + log |W ϵ | ≤ d log(1 + 8H 2 √ d ϵ 0 ) + d log(1 + 4H √ dN ϵ ). (13) Now we have the following key lemma. Lemma D.3. With probability 1 -δ, for any policy π ∈ Π eval ϵ0 and any linear reward function r that may appear in Algorithm 3, the {V h } h∈[H] derived by Algorithm 3 satisfies that for any h ∈ [H], N n=1 ϕ n h V h+1 (s n h+1 ) - s ′ ∈S P h (s ′ |s n h , a n h )V h+1 (s ′ ) Λ -1 h ≤ cH √ d • log( Hd ϵ 0 δ ) + log( N δ ), for some universal constant c > 0. Proof of Lemma D.3. The proof is by plugging ϵ = H √ d N in Lemma D.1 and using ( 13). Remark D.4. Assume the final goal is to find ϵ-optimal policy for all reward functions, we can choose that ϵ 0 ≥ poly(ϵ) and N ≤ poly(d, H,  s ′ ∈S P h (s ′ |s, a)V h+1 (s ′ )| ≤ c ′ H √ d • log( Hd ϵ 0 δ ) + log( N δ ) • ∥ϕ(s, a)∥ Λ -1 h , for some universal constant c ′ > 0. This part of proof is similar to the proof of Lemma 3.1 in Wang et al. (2020) . For completeness, we state it here. Proof of Lemma D.5. Since P h (s ′ |s, a) = ϕ(s, a) ⊤ µ h (s ′ ), we have s ′ ∈S P h (s ′ |s, a)V h+1 (s ′ ) = ϕ(s, a) ⊤ w h , for some ∥ w h ∥ 2 ≤ H √ d. Therefore, we have ϕ(s, a) ⊤ wh - s ′ ∈S P h (s ′ |s, a)V h+1 (s ′ ) =ϕ(s, a) ⊤ (Λ h ) -1 N n=1 ϕ n h • V h+1 (s n h+1 ) - s ′ ∈S P h (s ′ |s, a)V h+1 (s ′ ) =ϕ(s, a) ⊤ (Λ h ) -1 N n=1 ϕ n h • V h+1 (s n h+1 ) -Λ h w h =ϕ(s, a) ⊤ (Λ h ) -1 N n=1 ϕ n h V h+1 (s n h+1 ) -w h - N n=1 ϕ n h (ϕ n h ) ⊤ w h =ϕ(s, a) ⊤ (Λ h ) -1 N n=1 ϕ n h V h+1 (s n h+1 ) - s ′ P h (s ′ |s n h , a n h )V h+1 (s ′ ) -w h . ( ) It holds that, ϕ(s, a) ⊤ (Λ h ) -1 N n=1 ϕ n h V h+1 (s n h+1 ) - s ′ P h (s ′ |s n h , a n h )V h+1 (s ′ ) ≤∥ϕ(s, a)∥ Λ -1 h • N n=1 ϕ n h V h+1 (s n h+1 ) - s ′ ∈S P h (s ′ |s n h , a n h )V h+1 (s ′ ) Λ -1 h ≤cH √ d • log( Hd ϵ 0 δ ) + log( N δ ) • ∥ϕ(s, a)∥ Λ -1 h , for some constant c due to Lemma D.3. In addition, we have |ϕ(s, a) ⊤ (Λ h ) -1 w h | ≤ ∥ϕ(s, a)∥ Λ -1 h • ∥ w h ∥ Λ -1 h ≤ H √ d • ∥ϕ(s, a)∥ Λ -1 h . Combining these two results, we have |ϕ(s, a) ⊤ wh - s ′ ∈S P h (s ′ |s, a)V h+1 (s ′ )| ≤ c ′ H √ d • log( Hd ϵ 0 δ ) + log( N δ ) • ∥ϕ(s, a)∥ Λ -1 h . Finally, the error bound of our estimations are summarized in the following lemma. Lemma D.6. For π ∈ Π eval ϵ0 and linear reward function r, let the output of Algorithm 3 be V π (r). Then with probability 1 -δ, for any policy π ∈ Π eval ϵ0 and any linear reward function r, it holds that | V π (r) -V π (r)| ≤ c ′ H √ d • log( Hd ϵ 0 δ ) + log( N δ ) • E π H h=1 ∥ϕ(s h , a h )∥ Λ -1 h , for some universal constant c ′ > 0. Proof of Lemma D.6. For any policy π ∈ Π eval ϵ0 and any linear reward function r, consider the V h functions and wh in Algorithm 3, we have |V 1 (s 1 ) -V π 1 (s 1 )| ≤ E π ϕ(s 1 , a 1 ) ⊤ w1 + r 1 (s 1 , a 1 ) - s ′ ∈S P 1 (s ′ |s 1 , a 1 )V π 2 (s ′ ) -r 1 (s 1 , a 1 ) ≤E π ϕ(s 1 , a 1 ) ⊤ w1 - s ′ ∈S P 1 (s ′ |s 1 , a 1 )V 2 (s ′ ) + E π s ′ ∈S P 1 (s ′ |s 1 , a 1 ) |V 2 (s ′ ) -V π 2 (s ′ )| ≤E π c ′ H √ d • log( Hd ϵ 0 δ ) + log( N δ ) • ∥ϕ(s 1 , a 1 )∥ Λ -1 1 + E π |V 2 (s 2 ) -V π 2 (s 2 )| ≤ • • • ≤c ′ H √ d • log( Hd ϵ 0 δ ) + log( N δ ) • E π H h=1 ∥ϕ(s h , a h )∥ Λ -1 h , where the first inequality results from the fact that V π 1 (s 1 ) ∈ [0, H]. The third inequality comes from Lemma D.5. The forth inequality is due to recursive application of decomposition. Remark D.7. Compared to the analysis in Wang et al. (2020) and Huang et al. (2022) , our analysis saves a factor of √ d. This is achieved by discretization of the policy set and bypassing the need to cover the quadratic bonus term. More specifically, the log-covering number of our Π eval h is O(d). Combining with the covering set of Euclidean ball in R d , the total log-covering number is still O(d). In contrast, both previous works need to cover bonus like ϕ(•, •) ⊤ (Λ) -1 ϕ(•, •), which requires the log-covering number to be O(d 2 ).

E GENERALIZED ALGORITHMS FOR ESTIMATING VALUE FUNCTIONS

Since Π exp we construct in Section C.3 is guaranteed to cover explorative policies under any feasible linear MDP, it suffices to do exploration using only policies from Π exp . In this section, we generalize the algorithm we propose in Section D for our purpose during exploration phase. To be more specific, we design an algorithm to estimate E π r(s h , a h ) for any policy π ∈ Π exp and any reward r. Recall that given accuracy ϵ 1 , the policy set we construct in Section C.3 is Π exp ϵ1 and the policy set for layer h is Π exp ϵ1,h .

E.1 THE ALGORITHM

Algorithm 4 Estimation of E π r(s h , a h ) given exploration data (EstimateER) 1: Input: Policy to evaluate π ∈ Π exp ϵ1 . Reward function r(s, a) and its uniform upper bound A. Layer h. Exploration data {s n h , a n h } ( h,n)∈[H]×[N ] . Initial state s 1 . 2: Initialization: Q h (•, •) ← r(•, •), V h (•) ← Q h (•, π h (•)). 3: for h = h -1, h -2, . . . , 1 do 4: Λ h ← I + N n=1 ϕ(s n h , a n h )ϕ(s n h , a n h ) ⊤ . 5: w h ← (Λ h ) -1 N n=1 ϕ(s n h , a n h )V h+1 (s n h+1 ). 6: Q h (•, •) ← (ϕ(•, •) ⊤ w h ) [0,A] . 7: V h (•) ← Q h (•, π h (•)). 8: end for 9: Output: V 1 (s 1 ). Algorithm 4 applies LSVI to estimate E π r(s h , a h ) for any π ∈ Π exp ϵ1 (according to our construction, all possible π's are deterministic), any reward function r and any time step h. Note that the algorithm takes the uniform upper bound A of all possible reward functions (i.e., for any reward function r that may appear as the input, r ∈ [0, A]) as the input, and uses the value of A to truncate the Q-function in line 6. Algorithm 4 looks similar to Algorithm 3 while there are two key differences. First, the reward function is non-zero at only one layer in Algorithm 4 while the reward function in Algorithm 3 can be any valid reward functions. In addition, Algorithm 4 takes the upper bound of reward function as input and uses this value to bound the Q-functions while Algorithm 3 uses H as the upper bound.

E.2 TECHNICAL LEMMAS

Lemma E.1 (Generalization of Lemma D.4 of Jin et al. (2020b) ). Let {x τ } ∞ τ =1 be a stochastic process on state space S with corresponding filtration {F τ } ∞ τ =0 . Let {ϕ τ } ∞ τ =1 be an R d -valued stochastic process where ϕ τ ∈ F τ -1 , and ∥ϕ τ ∥ ≤ 1. Let Λ k = I + k τ =1 ϕ τ ϕ ⊤ τ . Then for any δ > 0, with probability at least 1 -δ, for all k ≥ 0, and any V ∈ V so that sup x |V (x)| ≤ A, we have: k τ =1 ϕ τ {V (x τ ) -E[V (x τ )|F τ -1 ]} 2 Λ -1 k ≤ 4A 2 d 2 log(k + 1) + log( N ϵ δ ) + 8k 2 ϵ 2 , where N ϵ is the ϵ-covering number of V with respect to the distance dist(V, V ′ ) = sup x |V (x) - V ′ (x)|. Lemma E.2. If A ≤ 1, the w h in line 5 of Algorithm 4 is always bounded by ∥ w h ∥ 2 ≤ √ dN . Proof of Lemma E.2. The proof is almost identical to Lemma D.2, the only difference is that H is replaced by 1.

E.3 UPPER BOUND OF ESTIMATION ERROR

We first consider the covering number of all possible V h in Algorithm 4. In the remaining part of this section, we assume that the set of all reward functions to be estimated is R with uniform probability 1 -δ, for any policy π ∈ Π exp ϵ1 , any reward function r ∈ R and any layer h, it holds that | E π r(s h , a h ) -E π r(s h , a h )| ≤c ′ A R • d 2 log( Hd ϵ 1 δ ) + d log( N δ ) + B A R/N • E π h-1 h=1 ∥ϕ(s h , a h )∥ Λ -1 h , for some universal constant c ′ > 0. Proof of Lemma E.5. For any policy π ∈ Π exp ϵ0 , any reward function r ∈ R and any layer h, consider the {V h } h∈[h] functions and { w h } h∈[h-1] in Algorithm 4, we have E π r(s h , a h ) = V 1 (s 1 ) . Besides, we abuse the notation and let r denote the reward function where r h ′ (s, a) = 1(h ′ = h)r(s, a), let the value function under this r be V π h (s), then V π 1 (s 1 ) = E π r(s h , a h ). It holds that | E π r(s h , a h ) -E π r(s h , a h )| =|V 1 (s 1 ) -V π 1 (s 1 )| ≤E π ϕ(s 1 , a 1 ) ⊤ w1 - s ′ ∈S P 1 (s ′ |s 1 , a 1 )V π 2 (s ′ ) ≤E π ϕ(s 1 , a 1 ) ⊤ w1 - s ′ ∈S P 1 (s ′ |s 1 , a 1 )V 2 (s ′ ) + E π s ′ ∈S P 1 (s ′ |s 1 , a 1 ) |V 2 (s ′ ) -V π 2 (s ′ )| ≤E π c ′ A R • d 2 log( Hd ϵ 1 δ ) + d log( N δ ) + B A R /N • ∥ϕ(s 1 , a 1 )∥ Λ -1 1 + E π |V 2 (s 2 ) -V π 2 (s 2 )| ≤ • • • ≤c ′ A R • d 2 log( Hd ϵ 1 δ ) + d log( N δ ) + B A R /N • E π h-1 h=1 ∥ϕ(s h , a h )∥ Λ -1 h + E π |V h (s h ) -V π h (s h )| =c ′ A R • d 2 log( Hd ϵ 1 δ ) + d log( N δ ) + B A R /N • E π h-1 h=1 ∥ϕ(s h , a h )∥ Λ -1 h , where the first inequality results from the fact that V π 1 (s 1 ) ∈ [0, A R]. The third inequality comes from Lemma E.4. The fifth inequality is due to recursive application of decomposition. The last equation holds since V h (•) = V π h (•) = r(•, π h (•)) . Remark E.6. From Lemma E.5, we can see that the estimation error at layer h can be bounded by the summation of uncertainty from the previous layers, with additional factor of O(Ad). Therefore, if the uncertainty of all previous layers are small with respect to Π exp , we can estimate E π r h accurately for any π ∈ Π exp and any reward r from a large set of reward functions. Remark E.7. Note that we only need to estimate E π r(s h , a h ) accurately for π ∈ Π exp . For π ∈ ∆(Π exp ), if π takes policy π i ∈ Π exp with probability p i (for i ∈ [k]), then we define E π r(s h , a h ) := i∈[k] p i • E πi r(s h , a h ), where E π r(s h , a h ) is the estimation we acquire w.r.t policy π and E πi r(s h , a h ) is the output of Algorithm 4 with input π i ∈ Π exp . Assume that for all π ∈ Π exp , | E π r(s h , a h ) -E π r(s h , a h )| ≤ e, we have for all π ∈ ∆(Π exp ), | E π r(s h , a h )-E π r(s h , a h )| ≤ i p i | E πi r(s h , a h )-E πi r(s h , a h )| ≤ e. Therefore, the conclusion of Lemma E.5 naturally holds for π ∈ ∆(Π exp ). F PROOF OF THEOREM 5.1 Recall that ι = log(dH/ϵδ), ε = C1ϵ . Number of episodes for each deployment is N = C2dι ε2 = C2d 2 H 4 ι 3 C 2 1 ϵ 2 . In addition, Σ π is short for E π [ϕ h ϕ ⊤ h ] while Σ π is short for E π [ϕ h ϕ ⊤ h ]. For clarity, we restrict our choice that 0 < C 1 < 1 and C 2 , C 3 > 1. We begin with detailed explanation of Σ π and E π ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h ) from (1). F.1 DETAILED EXPLANATION First of all, as have been pointed out in Algorithm 1, Σ π is short for E π [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ]. Assume the feature map is ϕ(s, a) = (ϕ 1 (s, a), ϕ 2 (s, a), • • • , ϕ d (s, a)) ⊤ , where ϕ i (s, a) ∈ R. Then the estimation of covariance matrix is calculated pointwisely. For each coordinate i, j ∈ [d] × [d], we use Algorithm 4 to estimate E π r(s h , a h ) = E π ϕi(s h ,a h )ϕj (s h ,a h )+1 2 foot_10 . More specifically, for ), the estimation is derived by ( 25) in E.7. any π ∈ Π exp ϵ 3 , Σ π(ij) = 2 E ij -1, In the discussion below, we only need to bound ∥ E π ϕ h ϕ ⊤ h -E π ϕ h ϕ ⊤ h ∥ 2 for all π ∈ Π exp ϵ 3 and the same bound applies to all π ∈ ∆(Π exp ϵ 3 ). The second estimator is  E π ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h ) , which is calculated via directly applying Algorithm 4 with input π ∈ Π exp ϵ 3 , r(s, a) = ϕ(s, a) ⊤ (N • Σ π ) -1 ϕ(s, a) with A = ε C2d 3 Hι 2 = C1ϵ C2d 7/2 H 3 ι 3 ,

F.2 TECHNICAL LEMMAS

In this part, we state some technical lemmas. Lemma F.1 (Lemma H.4 of Min et al. (2021) ). Let ϕ : S × A → R d satisfies ∥ϕ(s, a)∥ ≤ C for all s, a ∈ S × A. For any K > 0, λ > 0, define ḠK = K k=1 ϕ(s k , a k )ϕ(s k , a k ) ⊤ + λI d where (s k , a k )'s are i.i.d samples from some distribution ν. Then with probability 1 -δ, ḠK K -E ν ḠK K 2 ≤ 4 √ 2C 2 √ K log 2d δ 1/2 . ( ) Lemma F.2 (Corollary of Lemma D.6). There exists universal constant c D > 0, such that with our choice of ϵ 0 = ϵ 3 and N = C2d 2 H 4 ι 3 C 2 1 ϵ 2 , the multiplicative factor of (16) satisfies that c ′ H √ d • log( Hd ϵ 0 δ ) + log( N δ ) ≤ c D H √ d • log( C 2 dH C 1 ϵδ ). ( ) Proof of Lemma F.2. The existence of universal constant c D holds since c ′ in ( 16) is universal constant and direct calculation.  ≥ C2d 7/2 H 3 ι 3 C1ϵ }. Let A R = C1ϵ C2d 7/2 H 3 ι 3 and N = C2d 2 H 4 ι 3 C 2 1 ϵ 2 , we have that the A R N -cover R A R /N of R satisfies that for some universal constant c F > 0, B A R/N = log(| RA R/N |) ≤ c F d 2 log( C 2 dH C 1 ϵ ). ( ) Proof of Lemma F.3. The conclusion holds due to Lemma D.6 of Jin et al. (2020b) and direct calculation. Lemma F.4 (Corollary of Lemma E.5). There exists universal constant c 1 E > 0 such that for the first case in Section F.1 with our choice of ϵ 1 = ϵ 3 , A = 1, B = 2 log(d) and N = C2d 2 H 4 ι 3 C 2 1 ϵ 2 , the multiplicative factor of (23) satisfies that c ′ A R • d 2 log( Hd ϵ 1 δ ) + d log( N δ ) + B A R/N ≤ c 1 E • d log( C 2 dH C 1 ϵδ ). ( ) Proof of Lemma F.4. The existence of universal constant c 1 E holds since c ′ in ( 23) is universal constant and direct calculation. Lemma F.5 (Corollary of Lemma E.5). There exists universal constant c 2 E > 0 such that for the second case in Section F.1 with our choice of ϵ 1 = ϵ 3 , A = ε C2d 3 Hι 2 = C1ϵ C2d 7/2 H 3 ι 3 , B = c F d 2 log( C2dH C1ϵ ) and N = C2d 2 H 4 ι 3 C 2 1 ϵ 2 , the multiplicative factor of (23) satisfies that c ′ A R • d 2 log( Hd ϵ 1 δ ) + d log( N δ ) + B A R /N ≤ c 2 E • ε C 2 d 2 Hι log( C 2 dH C 1 ϵδ ). ( ) Proof of Lemma F.5. The existence of universal constant c 2 E holds since c ′ in ( 23) is universal constant and direct calculation. Now that we have the universal constants c D , c F , c 1 E , c 2 E , for notational simplicity, we let c E = max{c 1 E , c 2 E }. Therefore, the conclusions of Lemma F.4 and F.5 hold if we replace c i E with c E .

F.3 CHOICE OF UNIVERSAL CONSTANTS

In this section, we determine the choice of universal constants in Algorithm 1 and Theorem 5.1. First, C 1 , C 2 satisfies that C 1 • C 2 = 1, 0 < C 1 < 1 and the following conditions: c D H √ d • log( C 2 dH C 1 ϵδ ) ≤ 1 3C 1 H √ d log( dH ϵδ ). ( ) c E • ε C 2 d 2 Hι log( C 2 dH C 1 ϵδ ) ≤ ε 2d 2 H . ( ) It is clear that when C 2 is larger than some universal threshold and C 1 = 1 C2 , the constants C 1 , C 2 satisfy the previous four conditions. Next, we choose C 3 such that C 3 4 log( dH ϵδ ) ≥ c E log( C 2 dH C 1 ϵδ ), and C 4 = 80C 1 C 3 . Since c D , c E , c F are universal constants, our C 1 , C 2 , C 3 , C are also universal constants that are independent with the parameters d, H, ϵ, δ.

F.4 RESTATE THEOREM 5.1 AND OUR INDUCTION

Theorem F.6 (Restate Theorem 5.1). We run Algorithm 1 to collect data and let Planning(•) denote the output of Algorithm 2. For the universal constants C 1 , C 2 , C 3 , C 4 we choose, for any ϵ > 0 and δ > 0, as well as ϵ < H(λ ⋆ ) 2 C4d 7/2 log(1/λ ⋆ ) , with probability 1 -δ, for any feasible linear reward function r, Planning(r) returns a policy that is ϵ-optimal with respect to r. Throughout the proof in this section, we assume that the condition ϵ < H(λ ⋆ ) 2 C4d 7/2 log(1/λ ⋆ ) holds. Then we state our induction condition. Condition F.7 (Induction Condition). Suppose after h -1 deployments (i.e., after the exploration of the first h -1 layers), the dataset D h-1 = {s n h , a n h } h,n∈[H]×[(h-1)N ] and Λ h-1 h = I + (h-1)N n=1 ϕ n h (ϕ n h ) ⊤ for all h ∈ [H]. The induction condition is: max π∈Π exp ϵ 3 E π   h-1 h=1 ϕ(s h , a h ) ⊤ (Λ h-1 h ) -1 ϕ(s h , a h )   ≤ (h -1)ε. Suppose that after h deployments, the dataset D h = {s n h , a n h } h,n∈[H]×[hN ] and Λ h h = I + hN n=1 ϕ n h (ϕ n h ) ⊤ for all h ∈ [H]. We will prove that given condition F.7 holds, with probability at least 1 -δ, the following induction holds: max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (Λ h h ) -1 ϕ(s h , a h ) ≤ ε. ( ) Note that the induction (35) naturally implies that max π∈Π exp ϵ 3 E π   h h=1 ϕ(s h , a h ) ⊤ (Λ h h ) -1 ϕ(s h , a h )   ≤ hε. ( ) Suppose after the whole exploration process, the dataset D = {s n h , a n h } h,n∈[H]×[HN ] and Λ h = I + HN n=1 ϕ n h (ϕ n h ) ⊤ for all h ∈ [H]. If the previous induction holds, we have with probability 1 -Hδ, max π∈Π exp ϵ 3 E π H h=1 ϕ(s h , a h ) ⊤ (Λ h ) -1 ϕ(s h , a h ) ≤ Hε. Next we begin the proof of such induction. We assume the Condition F.7 holds and prove (35).

F.5 ERROR BOUND OF ESTIMATION

Recall that the policy we apply to explore the h-th layer is π h = argmin π∈∆(Π exp ϵ 3 ) s.t. λmin( Σπ)≥C3d 2 Hει max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h ) , where the detailed definition of Σ π and E π ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h ) are explained in Section F.1. In addition, we define the optimal policy π⋆ h for exploring layer h: π⋆ h = argmin π∈∆(Π exp ϵ 3 ) max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h ) , where E π means the actual expectation. Similarly, Σ π is short for E π [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ]. According to Lemma C.8, since ϵ ≤ H(λ ⋆ ) 2 C4d 7/2 log(1/λ ⋆ ) ≤ λ ⋆ 4 15 , we have sup π∈∆(Π exp ϵ 3 ) λ min (E π ϕ h ϕ ⊤ h ) ≥ (λ ⋆ ) 2 64d log(1/λ ⋆ ) . Therefore, together with the conclusion of Lemma B.4 and our definition of π⋆ h , it holds that: and all coordinates (i, j) ∈ λ min (E π⋆ h ϕ h ϕ ⊤ h ) ≥ (λ ⋆ ) 2 64d 2 log(1/λ ⋆ ) . [d] × [d], it holds that E π [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] (ij) -E π [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] (ij) ≤ C 3 dHει 4 . ( ) Proof of Lemma F.8. We have LHS ≤c ′ d 2 log( 3Hd ϵδ ) + d log( N δ ) + 2 log(d) • E π h-1 h=1 ∥ϕ(s h , a h )∥ (Λ h-1 h ) -1 ≤c E • d log( C 2 dH C 1 ϵδ ) • Hε ≤ C 3 dHει 4 . The first inequality holds because Lemma E.5. The second inequality results from Lemma F.4 and our induction condition F.7. The last inequality is due to our choice of C 3 (33).

Now we can bound

E π [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] -E π [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] 2 by the following lemma. Lemma F.9 (ℓ 2 norm bound). With probability 1 -δ, for all π ∈ Π exp ϵ 3 , it holds that E π [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] -E π [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] 2 ≤ C 3 d 2 Hει 4 . ( ) Proof of Lemma F.9. The inequality results from Lemma F.8 and the fact that for any X ∈ R d×d , ∥X∥ 2 ≤ ∥X∥ F . Note that the conclusion also holds for all π ∈ ∆(Π exp ϵ 3 ) due to our discussion in Remark E.7. According to our condition that ϵ < H(λ ⋆ ) 2 C4d 7/2 log(1/λ ⋆ ) = H(λ ⋆ ) 2 80C1C3d 7/2 log(1/λ ⋆ ) and ( 41), we have λ min (E π⋆ h ϕ h ϕ ⊤ h ) ≥ (λ ⋆ ) 2 64d 2 log(1/λ ⋆ ) ≥ 5C 1 C 3 d 3/2 ϵ 4H = 5C 3 d 2 Hει 4 . Therefore, under the high probability case in Lemma F.9, due to Weyl's inequality, λ min ( E π⋆ h ϕ h ϕ ⊤ h ) ≥ C 3 d 2 Hει. We have (47) implies that π⋆ h is a feasible solution of the optimization problem (1) and therefore, max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (N • Σ π h ) -1 ϕ(s h , a h ) ≤ max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (N • Σ π⋆ h ) -1 ϕ(s h , a h ) , ) where π h is the policy we apply to explore layer h and λ min ( Σ π h ) ≥ C 3 d 2 Hει.

F.5.2 ERROR BOUND FOR THE SECOND ESTIMATOR

We consider the upper bound of ) such that λ min ( Σ π ) ≥ C 3 d 2 Hει, it holds that: E π ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h ) -E π ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h ) . Recall that E π ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h ) E π ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h ) -E π ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h ) ≤ ε2 2d 2 . (49) Proof of Lemma F.10. We have LHS ≤c E • ε C 2 d 2 Hι log( C 2 dH C 1 ϵδ ) • E π h-1 h=1 ∥ϕ(s h , a h )∥ (Λ h-1 h ) -1 ≤ ε 2d 2 H • Hε = ε2 2d 2 . ( ) The first inequality results from Lemma E.5 and Lemma F.5. The second inequality holds since our choice of C 2 (32) and induction condition F.7. Remark F.11. We have with probability 1 -δ (under the high probability case in Lemma F.10), due to the property of max{•}, for all π ∈ ∆(Π exp ϵ 3 ) such that λ min ( Σ π ) ≥ C 3 d 2 Hει, it holds that: max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h ) -max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (N • Σ π ) -1 ϕ(s h , a h ) ≤ ε2 2d 2 . F.6 MAIN PROOF With all preparations ready, we are ready to prove the main theorem. We assume the high probability cases in Lemma F.8 (which implies Lemma F.9) and Lemma F.10 hold. First of all, we have: max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (N • Σ π⋆ h ) -1 ϕ(s h , a h ) ≤ max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (N • Σ π⋆ h ) -1 ϕ(s h , a h ) + ε2 2d 2 ≤ max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (N • Σ π⋆ h ) -1 ϕ(s h , a h ) + ε2 8 ≤ max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ ( 4N 5 • Σ π⋆ h ) -1 ϕ(s h , a h ) + ε2 8 ≤ 5d 4N + ε2 8 ≤ 3ε 2 8 . The first inequality holds because of Lemma F.10 (and Remark F.11). The second inequality is because under meaningful case, d ≥ 2. The third inequality holds since under the high probability case in Lemma F.9, 16 The forth inequality is due to the definition of π⋆ h and Theorem B.1. The last inequality holds since our choice of N and C 2 . Σ π⋆ h 5 ≽ C3d 2 Hει 4 I d ≽ Σ π⋆ h -Σ π⋆ h can imply Σ π⋆ h ≽ 4 5 Σ π⋆ h , and thus ( Σ π⋆ h ) -1 ≼ ( 4 5 Σ π⋆ h ) -1 . Combining ( 52) and ( 48), we have 3ε 2 8 ≥ max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (N • Σ π h ) -1 ϕ(s h , a h ) . According to Lemma F.10, Remark F.11 and the fact that λ min ( Σ π h ) ≥ C 3 d 2 Hει. It holds that 3ε 2 8 ≥ max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (N • Σ π h ) -1 ϕ(s h , a h ) - ε2 8 . Or equivalently, ε2 2 ≥ max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (N • Σ π h ) -1 ϕ(s h , a h ) . Suppose after applying policy π h for N episodes, the data we collectfoot_13 is {s i h , a i h } i∈[N ] . Assume Λh = I + N i=1 ϕ(s i h , a i h )ϕ(s i h , a i h ) ⊤ , we now consider the relationship between Λh and Σ π h . First, according to Lemma F.9, we have: N • Σ π h -N • Σ π h ≼ C 3 N d 2 Hει 4 • I d ≼ 1 4 N • Σ π h . Besides, due to Lemma F.1 (with C = 1), with probability 1 -δ, N • Σ π h -Λh ≼ 4 √ 2 √ N ι • I d ≼ C 3 N d 2 Hει 4 • I d ≼ 1 4 N • Σ π h . Combining ( 55) and ( 56), we have with probability 1 -δ, N • Σ π h -Λh ≼ 1 2 N • Σ π h , or equivalently, (N • Σ π h ) -1 ≽ (2 Λh ) -1 . Plugging ( 58) into (54), we have with probability 1 -δ, ε2 2 ≥ max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (N • Σ π h ) -1 ϕ(s h , a h ) ≥ max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (2 Λh ) -1 ϕ(s h , a h ) ≥ 1 2 max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ ( Λh ) -1 ϕ(s h , a h ) 2 , where the last inequality follows Cauchy-Schwarz inequality. Recall that after the exploration of layer h, Λ h h in ( 35) uses all previous data up to the h-th deployment, which implies that Λ h h ≽ Λh and (Λ h h ) -1 ≼ ( Λh ) -1 . Therefore, with probability 1 -δ, ε ≥ max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ ( Λh ) -1 ϕ(s h , a h ) ≥ max π∈Π exp ϵ 3 E π ϕ(s h , a h ) ⊤ (Λ h h ) -1 ϕ(s h , a h ), which implies that the induction process holds. Recall that after the whole exploration process for all H layers, the dataset D = {s n h , a n h } h,n∈[H]×[HN ] and Λ h = I + HN n=1 ϕ n h (ϕ n h ) ⊤ for all h ∈ [H]. Due to induction, we have with probability 1 -Hδ, max π∈Π exp ϵ 3 E π H h=1 ϕ(s h , a h ) ⊤ (Λ h ) -1 ϕ(s h , a h ) ≤ Hε. Published as a conference paper at ICLR 2023 In addition, according to Lemma C.7, Π eval ϵ 3 ⊆ Π exp ϵ 3 , we have max π∈Π eval ϵ 3 E π H h=1 ϕ(s h , a h ) ⊤ (Λ h ) -1 ϕ(s h , a h ) ≤ Hε. Given ( 62), we are ready to prove the final result. Recall that the output of Algorithm 3 (with input π and r) is V π (r). With probability 1 -δ, for all feasible linear reward function r, for all π ∈ Π eval ϵ 3 , it holds that | V π (r) -V π (r)| ≤c ′ H √ d • log( 3Hd ϵδ ) + log( N δ ) • E π H h=1 ∥ϕ(s h , a h )∥ Λ -1 h ≤c D H √ d • log( C 2 dH C 1 ϵδ ) • Hε ≤ 1 3C 1 H √ dι • Hε = ϵ 3 , where the first inequality holds due to Lemma D.6. The second inequality is because of Lemma F.2 and (62). The third inequality holds since our choice of C 1 (31). The last equation results from our definition that ε = C1ϵ H 2 √ dι . Suppose π(r) = arg max π∈Π eval ϵ 3 V π (r). Since our output policy π(r) is the greedy policy with respect to V π (r), we have V π(r) (r) -V π(r) (r) ≤V π(r) (r) -V π(r) (r) + V π(r) (r) -V π(r) (r) + V π(r) (r) -V π(r) (r) ≤ 2ϵ 3 . (64) In addition, according to Lemma C.5, V ⋆ (r) -V π(r) (r) ≤ ϵ 3 . Combining these two results, we have with probability 1 -δ, for all feasible linear reward function r, V ⋆ (r) -V π(r) (r) ≤ ϵ. (65) Since the deployment complexity of Algorithm 1 is clearly bounded by H, the proof of Theorem 5.1 is completed.

G COMPARISONS ON RESULTS AND TECHNIQUES

In this section, we compare our results with the closest related work (Huang et al., 2022) . We begin with comparison of the conditions. Comparison of conditions. In Assumption 2.1, we assume that the linear MDP satisfies λ ⋆ = min h∈[H] sup π λ min (E π [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ]) > 0. In comparison, Huang et al. (2022) assume that ν min = min h∈[H] min ∥θ∥=1 max π E π [(ϕ ⊤ h θ) 2 ] > 0. Overall these two assumptions are analogous reachability assumptions, while our assumption is slightly stronger since ν 2 min is lower bounded by λ ⋆ . Dependence on reachability coefficient. Our Algorithm 1 only takes ϵ as input and does not require the knowledge of λ ⋆ , while the theoretical guarantee in Theorem 5.1 requires additional condition that ϵ is small compared to λ ⋆ . For ϵ larger than a problem-dependent threshold, the theoretical guarantee no longer holds. Such dependence is similar to the dependence on reachability coefficient ν min in Zanette et al. (2020b) where their algorithm also takes ϵ as input and requires ϵ to be small compared to ν min . In comparison, Algorithm 2 in Huang et al. (2022) takes the reachability coefficient ν min as input, which is a stronger requirement than requiring ϵ to be small compared to λ ⋆ . Comparison of sample complexity bounds. Our main improvement over Huang et al. (2022) , where ν min is always upper bounded by 1 and can be arbitrarily small (please see the illustration below). In the large-ϵ regime, the sample complexity bounds in both works look like poly(d, H, 1 λ ⋆ ) (or poly(d, H, 1 νmin )), and such "Burn in" period is common in optimal experiment design based works (Wagenmaker & Jamieson, 2022) . Illustration of ν min . In this part, we construct some examples to show what ν min will be like. First, consider the following simple example where the linear MDP 1 is defined as: 1. The linear MDP is a tabular MDP with only one action and several states (A = 1, S > 1). 2. The features are canonical basis (Jin et al., 2020b) and thus d = S. 3. The transition from any (s, a) ∈ S × A at any time step h ∈ [H] is uniformly random. Therefore, under linear MDP 1, both ν 2 min in Huang et al. (2022) and our λ ⋆ are 1 d and our improvement on sample complexity is a factor of d 2 . Generally speaking, this example has a relatively large ν min , and there are various examples with even smaller ν min . Next, we construct the linear MDP 2 that is similar to the linear MDP 1 but does not have uniform transition kernel: 1. The linear MDP is a tabular MDP with only one action and several states (A = 1, S > 1). 2. The features are canonical basis (Jin et al., 2020b) and thus d = S. Therefore, under linear MDP 2, both ν 2 min in Huang et al. (2022) and our λ ⋆ are p min (p min ≤ 1 d ) and our improvement on sample complexity is a factor of d/p min which is always larger than d 2 and can be much larger. In the worst case, according to the condition (ϵ < ν 8 min ) for the asymptotic sample complexity in Huang et al. (2022) to dominate, p min = ν 2 min can be as small as ϵ 1/4 , and the sample complexity in Huang et al. (2022) is O( 1 ϵ 2.25 ), which does not have optimal dependence on ϵ. In conclusion, our improvement on sample complexity is at least a factor of d and can be much more significant under various circumstances. Technique comparison. We discuss why we can get rid of the d ν 2 min dependence in Huang et al. (2022) . First, instead of minimizing max π E π ∥ϕ h ∥ Λ -1 h , we only minimize the smaller max π∈Π exp ϵ/3 E π ∥ϕ h ∥ Λ -1 h , where the maximum is taken over our explorative policy set. Therefore, our approximation of generalized G-optimal design helps save the factor of 1/ν 2 min . In addition, note that in Lemma 6.3, the dependence on d is only √ d, this is because we estimate the value functions (w.r.t π and r) instead of adding optimism and using LSVI. Compared to the log-covering number O(d 2 ) of the bonus term ϕ ⊤ h Λ -1 ϕ h , our covering of (policy π h ∈ Π eval ϵ/3,h , linear reward r h ) has log-covering number O(d). where d π h (s, a) is estimated through applying Algorithm 4. Suppose ϵ ≤ Hdm C4SA , with probability 1 -δ, for any reward function r, Algorithm 2 returns a policy that is ϵ-optimal with respect to r. In addition, the deployment complexity of Algorithm 1 is H while the number of trajectories is O( S 2 AH 5 ϵ 2 ). Proof of Theorem H.2. Since the proof is quite similar to the proof of Theorem 5.1, we sketch the proof and highlight the difference to the linear MDP setting while ignoring details. Suppose after the h-th deployment, the visitation number of ( h, s, a) is N h h (s, a). Then our induction condition becomes after the (h -1)-th deployment, max π h-1 h=1 s,a d π h (s,a) N h-1 h (s,a) ≤ (h -1)ε. We base on this condition and prove that with high probability, max π s,a d π h (s,a) √ N h h (s,a) ≤ ε. First, under tabular MDP, Algorithm 4 is equivalent to value iteration based on empirical transition kernel. Therefore, due to standard methods like simulation lemma, we have with high probability, for any π ∈ Π 0 and reward r with upper bound A (the V h function is the one we derive in Algorithm 4), ≤ Hε. Using identical proof to (67), we have with high probability, for all π ∈ Π 0 and r, | E π r(s h , a h ) -E π r(s h , a h )| ≤E π h-1 h=1 P h -P h • V h+1 (s h , a h ) ≤E π h-1 h=1 A • P h -P h 1 ≤ O   A √ S • E π h-1 h=1 1 N h-1 h (s h , a h )   ≤ O   A √ S • h-1 h=1 s,a | V π (r) -V π (r)| ≤ O(H √ S • Hε) ≤ ϵ 2 . ( ) Since Π 0 contains the optimal policy, our output policy is ϵ-optimal.

H.2 PROOF OF LOWER BOUNDS

For regret minimization, we assume the number of episodes is K while the number of steps is T := KH.



We abuse the notation r so that r also denotes the expected (immediate) reward function. The generalized case where the initial distribution is an arbitrary distribution can be recovered from this setting by adding one layer to the MDP. Also known as reward-aware RL, which aims to identify near optimal policy given reward function. For more details about explorative policies, please refer to Appendix C.3. For more details about policies to evaluate, please refer to Appendix C.2. For Σπ, what we need to handle is matrix reward ϕ h ϕ ⊤ h and stochastic policy π ∈ ∆(Π exp ϵ/3 ), we apply generalized version of Algorithm 4 to tackle this problem as discussed in Appendix F.1. C1, C2, C3 are the universal constants in Algorithm 1. νmin inHuang et al. (2022) is defined as νmin = min h∈[H] min ∥θ∥=1 maxπ Eπ[(ϕ ⊤ h θ) 2 ], which is also measurement of explorability. Note that νmin is always upper bounded by 1 and can be arbitrarily small. Our condition that sup π λmin(Eπϕ h ϕ ⊤ h ) ≥ λ ⋆ implies that for any u ∈ R d with ∥u∥2 = 1, maxπ Eπ(ϕ ⊤ h u) 2 ≥ λ ⋆ .Therefore, the proof of Lemma E.14 ofHuang et al. (2022) holds by plugging in c = 1. We will show that all cases we consider in this paper satisfy these two assumptions. The transformation is to ensure that the reward is larger than 0. We ignore the extreme case where H is super large for simplicity. When H is very large, we can simply construct Π exp ϵ/H instead and the proof is identical. Note that all matrices here are symmetric and positive definite. We only consider the data from layer h.



Output: Policy π.

s)}| where K is the number of episodes. For multi-armed bandits with A arms and T episodes, Cesa-Bianchi et al. (2013) first achieved the optimal O( √ AT ) regret with only O(A log log T ) policy switches. Simchi-Levi & Xu (2019) generalized the result by showing that to get optimal O( √ T ) regret bound, both the switching cost upper and lower bounds are of order A log log T . Under stochastic linear bandits, Abbasi-Yadkori et al. (2011) applied doubling trick to achieve the optimal regret O(d √ T ) with O(d log T ) policy switches. Under slightly different setting, Ruan et al. (2021) improved the result by improving the switching cost to O(log log T ) without worsening the regret bound. Under tabular MDP, Bai et al. (2019) applied doubling trick to Q-learning and reached regret bound O( √ H 3 SAT ) with local switching cost O(H 3 SA log T ). Zhang et al. (2020c) applied advantage decomposition to improve the regret bound and local switching cost bound to O( √ H 2 SAT ) and O(H 2 SA log T ), respectively. Recently, Qiao et al. (2022) showed that to achieve the optimal O( √ T ) regret, both the global switching cost upper and lower bounds are of order HSA log log T . Under linear MDP, Gao et al. (2021) applied doubling trick to LSVI-UCB and arrived at regret bound O( √ d 3 H 3 T ) while global switching cost is O(dH log T ). This result is generalized by Wang et al. (2021) to work for arbitrary switching cost budget. Huang et al. (2022) managed to do pure exploration under linear MDP within O(dH) switches.

) designed an algorithm with O( √ AT ) regret using O(log log T ) batches. Perchet et al. (2016) proved a regret lower bound of Ω(T 1 1-2 1-M ) for algorithms within M batches under 2-armed bandits setting, which means Ω(log log T ) batches are necessary for a regret bound of O( √ T ). The result is generalized to K-armed bandits by Gao et al. (2019). Under stochastic linear bandits, Han et al. (2020) designed an algorithm that has regret bound O( √ T ) while running in O(log log T ) batches. Ruan et al. (

2. When the linear MDP is actually a tabular MDP with finite state set |S| = S and finite action set |A| = A, the feature map reduces to canonical basis in R d = R SA with ϕ(s, a) = e (s,a)(Jin et al., 2020b). Let d π h (s, a) = P π (s h = s, a h = a) denote the occupancy measure, then the previous optimization problem (2corresponds to finding a policy π 0 that can cover all policies from the policy set Π. According to Lemma 1 inZhang et al. (2022)  (we only use the case where m = 1), the minimization of (4) can be bounded by d = SA.

Elliptical Potential Lemma, Lemma 26 ofAgarwal et al. (2020)). Consider a sequence of d × d positive semi-definite matrices X 1 , • • • , X T with max t T r(X t ) ≤ 1 and define

1 ϵ ). Then the R.H.S. of Lemma D.3 is of order O(H √ d), which effectively saves a factor of √ d compared to Lemma A.1 of Wang et al. (2020). Now we are ready to prove the following lemma. Lemma D.5. With probability 1 -δ, for any policy π ∈ Π eval ϵ0 and any linear reward function r that may appear in Algorithm 3, the {V h } h∈[H] and { wh } h∈[H] derived by Algorithm 3 satisfies that for all h, s, a ∈ [H] × S × A, |ϕ(s, a) ⊤ wh -

H 2 √ d•ι . The explorative policy set we construct is Π exp ϵ 3 while the policies to evaluate is Π eval ϵ 3

where E ij is the output of Algorithm 4 with input π, r(s, a) = ϕi(s,a)ϕj (s,a)+1 2 with A = 1, layer h and exploration dataset D. Therefore, the set of all possible rewards is R = { ϕi(s,a)ϕj (s,a)+1 2, (i, j) ∈ [d] × [d]}.The set R is a covering set of itself with log-covering number B ϵ = log(| R|) = 2 log d. In addition, note that the estimation Σ π(ij) = Σ π(ji) for all i, j, which means the estimation Σ π is symmetric. The above discussion tackles the case where π ∈ Π exp ϵ 3, for the general case where π ∈ ∆(Π exp ϵ 3

layer h and exploration dataset D. Note that the validity of uniform upper bound A holds since we only consider the case where λ min ( Σ π ) ≥ d 2 Hει, which means that λ min (N • Σ π ) ≥ d 2 Hει • C2dι ε2 C2d 3 Hι 2 ε . Therefore the set of all possible rewards is subset of R = {r(s, a) = ϕ(s, a) ⊤ (Σ) -1 ϕ(s, a)|λ min (Σ) ≥ C2d 7/2 H 3 ι 3 C1ϵ } and the ϵ-covering number is characterized by Lemma F.3 below.

Covering number). Consider the set of possible rewards R = {r(s, a) = ϕ(s, a) ⊤ (Σ) -1 ϕ(s, a)|λ min (Σ)

is calculated by calling Algorithm 4 with A = ε C2d 3 Hι 2 . Note that we only need to consider the case where π ∈ Π exp ϵ 3 , π ∈ ∆(Π exp ϵ 3 ) and λ min ( Σ π ) ≥ C 3 d 2 Hει. Lemma F.10. With probability 1 -δ, for all π ∈ Π exp ϵ 3 and all π ∈ ∆(Π exp ϵ 3

The transitions from any (s, a) ∈ S × A at any time step h ∈ [H] are the same and satisfies min s ′ ∈S P h (s ′ |s, a) = p min .

PROOF FOR SECTION 7 H.1 APPLICATION TO TABULAR MDP Recall that the tabular MDP has discrete state-action space with |S| = S, |A| = A. We transfer our Assumption 2.1 to its counterpart under tabular MDP, and assume it holds. Assumption H.1. Define d π h (•, •) to be the occupancy measure, i.e. d π h (s, a) = P π (s h = s, a h = a). Let d m = min h sup π min s,a d π h (s, a), we assume that d m > 0. Theorem H.2. We select ε = C1ϵ H 2 √ Sι , Π exp = Π eval = Π 0 = {all deterministic policies} and N = C2SAι ε2 = C2S 2 AH 4 ι 3

that our condition about ϵ is enough. Note that with high probability, for all policy π ∈ Π 0 and s, a, the estimation error of d π h (s, a) is bounded by √ S • Hε. As a result, the estimation error can be ignored compared tod π h h (s, a) or d π⋆ h h (s,a). With identical proof to Section F.6, we have the induction still holds.From the induction, suppose N h (s, a) is the final visitation number of (h, s, a), we have

s 1 ). // Both expectations are estimated via Algorithm 4.

Zihan Zhang, Yuhang Jiang, Yuan Zhou, and Xiangyang Ji. Near-optimal regret bounds for multibatch reinforcement learning. arXiv preprint arXiv:2210.08238, 2022.Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pp. 4532-4576. PMLR, 2021.

SAT ) regret under non-stationary MDP.Dann et al. (2019) provided policy certificates in addition to stating optimal regret bound. Different from these minimax optimal algorithms,Zanette & Brunskill (2019) derived problem-dependent regret bound, which can imply minimax regret bound. Another line of works studied regret minimization under linear MDP. Yang & Wang (2019) developed the first efficient algorithm for linear MDP with simulator. Jin et al. (2020b) applied LSVI-UCB to achieve the regret bound of O( √ d 3 H 3 T ). Later, Zanette et al. (2020a) improved the regret bound to O( √ d 2 H 3 T ) at the cost of computation. Recently, Hu et al. (2022) first reached the minimax optimal regret O(

A SUMMARY Comparison of different policy sets.

F.5.1 ERROR BOUND FOR THE FIRST ESTIMATORWe first consider the upper bound ofE π [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] -E π [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] as stated in first half of Section F.1), E π [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] is estimated through calling Algorithm 4 for each coordinate i, j ∈ [d] × [d]. Therefore, we first bound the pointwise error.Lemma F.8 (Pointwise error). With probability 1 -δ, for all π ∈ Π exp

is on the sample complexity bound in the small-ϵ regime. Comparing our asymptotic sample complexity bound O( d 2 H 5 ϵ 2 ) with O( d 3 H 5

ACKNOWLEDGMENTS

The research is partially supported by NSF Awards #2007117. The authors would like to thank Jiawei Huang and Nan Jiang for explaining the result of their paper.

annex

upper bound A R ≤ 1. In addition, assume there exists ϵ-covering Rϵ of R with covering number log(| Rϵ |) = B ϵ . 13 For fixed h ∈ [H], under the case where the layer to estimate is exactly h, V h can be written as:The set Π exp ϵ1,h × Rϵ provides an ϵ-covering of V h . Thus the covering number under this case is |Π exp ϵ1,h | • | Rϵ |. In addition, if the layer to estimate is some h ′ > h, then V h can be written as:where the setSince all possible V h is either the case in (18) (the layer to estimate is exactly h) or ( 19) (the layer to estimate is larger than h), for any h ∈ [H] the ϵ-covering number N ϵ of all possible V h satisfies that:Now we have the following key lemma. The proof is almost identical to Lemma D.3, so we omit it here.Lemma E.3. With probability 1 -δ, for any policy π ∈ Π exp ϵ1 , any reward function r ∈ R that may appear in Algorithm 4 (with the input A = A R) and layer h, the {V h } h∈[h] derived by Algorithm 4 satisfies that for any h ∈for some universal constant c > 0.Now we can provide the following Lemma E.4 whose proof is almost identical to Lemma D.5. The only difference is that H is replaced by A R. Lemma E.4. With probability 1 -δ, for any policy π ∈ Π exp ϵ1 , any reward function r ∈ R that may appear in Algorithm 4 (with the inputfor some universal constant c ′ > 0.Finally, the error bound of our estimations are summarized in the following lemma. Lemma E.5. For any policy π ∈ Π exp ϵ1 , any reward function r ∈ R that may appear in Algorithm 4 (with the input A = A R) and layer h, let the output of Algorithm 4 be E π r(s h , a h ). Then with Theorem H.3 (Restate Theorem 7.2). For any algorithm with the optimal O( poly(d, H)T ) regret bound, the switching cost is at least Ω(dH log log T ).Proof of Theorem H.3. We first construct a linear MDP with two states, the initial state s 1 and the absorbing state s 2 . For absorbing state s 2 , the choice of action is only a 0 , while for initial state s 1 , the choice of actions is {a 1 , a 2 , • • • , a d-1 }. Then we define the feature map:where for s 1 , a i (i ∈ [d -1]), the (i + 1)-th element is 1 while all other elements are 0. We now define the measure µ h and reward vector θ h as:), where r h,i 's are unknown non-zero values.Combining these definitions, we have:Therefore, for any deterministic policy, the only possible case is that the agent takes action a 1 and stays at s 1 for the first h -1 steps, then at step h the agent takes action a i (i ≥ 2) and transitions to s 2 with reward r h,i , later the agent always stays at s 2 with no more reward. For this trajectory, the total reward will be r h,i . Also, for any deterministic policy, the trajectory is fixed, like pulling an "arm" in multi-armed bandits setting. Note that the total number of such "arms" with non-zero unknown reward is at least (d -2)H. Even if the transition kernel is known to the agent, this linear MDP is still as difficult as a multi-armed bandits problem with Ω(dH) arms. Together will Lemma H.4 below, the proof is complete.Lemma H.4 (Theorem 2 in (Simchi-Levi & Xu, 2019) ). Under the K-armed bandits problem, there exists an absolute constant C > 0 such that for all K > 1, S ≥ 0, T ≥ 2K and for all policy π with switching budget S, the regret satisfieswhere q(S, K) = ⌊ S-1 K-1 ⌋. This further implies that Ω(K log log T ) switches are necessary for achieving O( √ T ) regret bound.Theorem H.5 (Restate Theorem 7.3). For any algorithm with the optimal O( poly(d, H)T ) regret bound, the number of batches is at least Ω( H log d T + log log T ).Proof of Theorem H.5. Corollary 2 of Gao et al. (2019) proved that under multi-armed bandits problem, for any algorithm with optimal O( √ T ) regret bound, the number of batches is at least Ω(log log T ). In the proof of Theorem H.3, we show that linear MDP can be at least as difficult as a multi-armed bandits problem, which means the Ω(log log T ) lower bound on batches also applies to linear MDP.In addition, Theorem B.3 in Huang et al. (2022) stated an Ω( H log d N H ) lower bound for deployment complexity for any algorithm with PAC guarantee. Note that one deployment of arbitrary policy is equivalent to one batch. Suppose we can design an algorithm to get O( √ T ) regret within K episodes and M batches, then we are able to identify near-optimal policy in M deployments while each deployment is allowed to collect K trajectories. Therefore, we have M ≥ Ω( H log d T ). Combining these two results, the proof is complete.

