IMPROVED SAMPLE COMPLEXITY FOR REWARD-FREE REINFORCEMENT LEARNING UNDER LOW-RANK MDPS

Abstract

In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an ϵ-optimal policy and achieve an ϵ-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on ϵ, as well as on K in the large d regime, where d and K respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting. * Equal contribution ) (which is rescaled under the condition of Here, d LV denotes the non-negative rank of the transition kernel, which may be exponentially larger than d as shown in Agarwal et al. (2020) , and η denotes the positive reachability probability to all states, where 1/η can be as large as √ d LV as shown in Uehara et al. (2022b). Recently, a rewardfree algorithm called RFOLIVE has been proposed under non-linear MDPs with low Bellman Eluder dimension (Chen et al., 2022b), which can be specialized to low-rank MDPs. However, RFOLIVE is computationally more costly and considers a special reward function class, making their complexity result not directly comparable to other studies on reward-free low-rank MDPs. This paper investigates reward-free RL under low-rank MDPs to address the following important open questions: H 5 K 5 d 3 LV ϵ 2 η ) in Modi et al. (2021) in three aspects: order on K is reduced; d can be exponentially smaller than d LV as shown in Agarwal et al. (2020); and no introduction of η, where 1/η

1. INTRODUCTION

Reward-free reinforcement learning, recently formalized by Jin et al. (2020b) , arises as a powerful framework to accommodate diverse demands in sequential learning applications. Under the rewardfree RL framework, an agent first explores the environment without reward information during the exploration phase, with the objective to achieve certain learning goals later on for any given reward function during the planning phase. Such a learning goal can be to find an ϵ-optimal policy, to achieve an ϵ-accurate system identification, etc. The reward-free RL paradigm may find broad application in many real-world engineering problems. For instance, reward-free exploration can be efficient when various reward functions are taken into consideration over a single environment, such as safe RL (Miryoosefi & Jin, 2021; Huang et al., 2022) , multi-objective RL (Wu et al., 2021) , multi-task RL (Agarwal et al., 2022; Cheng et al., 2022) , etc. Studies of reward-free RL on the theoretical side have been largely focused on characterizing the sample complexity to achieve a learning goal under various MDP models. Specifically, reward-free tabular RL has been studied in Jin et al. (2020a) ; Ménard et al. (2021) ; Kaufmann et al. (2021) ; Zhang et al. (2020) . For reward-free RL with function approximation, Wang et al. (2020) studied linear MDPs introduced by Jin et al. (2020b) , where both the transition and the reward are linear functions of a given feature extractor, Zhang et al. (2021b) studied linear mixture MDPs introduced by Ayoub et al. (2020) , and Zanette et al. (2020b) considered a classes of MDPs with low inherent Bellman error introduced by Zanette et al. (2020a) . In this paper, we focus on reward-free RL under low-rank MDPs, where the transition kernel admits a decomposition into two embedding functions that map to low dimensional spaces. Compared with linear MDPs, the feature functions (i.e., the representation) under low-rank MDPs are unknown, hence the design further requires representation learning and becomes more challenging. Rewardfree RL under low-rank MDPs was first studied by Agarwal et al. (2020) , and the authors introduced a provably efficient algorithm FLAMBE, which achieves the learning goal of system identification with a sample complexity of Õ( H 22 K 9 d 7 ϵ 10 ). Here d, H and K respectively denote the representation dimension, episode horizon, and action space cardinality. Later on, Modi et al. (2021) proposed a model-free algorithm MOFFLE for reward-free RL under low-nonnegative-rank MDPs (where feature functions are non-negative), for which the sample complexity for finding an ϵ-optimal policy scales as Õ( • For low-rank MDPs, none of previous studies establishes a lower bound on the sample complexity showing a necessary sample complexity requirement for near-optimal policy finding. • The sample complexity of previous algorithms in Agarwal et al. (2020) ; Modi et al. (2021) on reward-free low-rank MDP is polynomial in the involved parameters, but still much higher than desirable. It is vital to improve the algorithm to further reduce the sample complexity. • Previous studies on low-rank MDPs did not provide estimation accuracy guarantee on the learned representation (only on the transition kernels). However, such a representation learning guarantee can be very beneficial to reuse the learned representation in other RL environment.

1.1. MAIN CONTRIBUTIONS

We summarize our main contributions in this work below. • Lower bound: We provide the first-known lower bound Ω( HdK ϵ 2 ) on the sample complexity that holds for any algorithm under the same low-rank MDP setting. Our proof lies in a novel construction of hard MDP instances that capture the necessity of the cardinality of the action space on the sample complexity. Interestingly, comparing this lower bound for low-rank MDPs with the upper bound for linear MDPs in Wang et al. (2020) further implies that it is strictly more challenging to find near-optimal policy under low-rank MDPs than linear MDPs. • Algorithm: We propose a new model-based reward-free RL algorithm under low-rank MDPs. The central idea of RAFFLE lies in the construction of a novel exploration-driven reward, whose corresponding value function serves as an upper bound on the model estimation error. Hence, such a pseudo-reward encourages the exploration to collect samples over those state-action space where the model estimation error is large so that later stage of the algorithm can further reduce such an error based on those samples. Such reward construction is new for low-rank MDPs, and serve as the key reason for our improved sample complexity. • Sample complexity: We show that our algorithm can both find an ϵ-optimal policy and achieve an ϵ-accurate system identification via reward-free exploration, with a sample complexity of Õ( H 3 d 2 K(d 2 +K) ϵ 2 ), which matches our lower bound in terms of the dependence on ϵ as well as on K in the large d regime. Our result significantly improves that of Õ( H 22 K 9 d 7 ϵ 10 ) in Agarwal et al. (2020) to achieve the same goal. Our result also improves the sample complexity of Õ( can be as large as √ d LV . Further, our result on reward-free RL naturally achieves the goal of reward-known RL, which improves that of Õ H 5 d 4 K 2 ϵ 2 in Uehara et al. (2022b) by Θ(H 2 ). • Near-accurate representation learning: We design a planning algorithm that exploits the exploration phase of RAFFLE to further learn a provably near-accurate representation of the transition kernel without requiring further interaction with the environment. To the best of our knowledge, this is the first theoretical guarantee on representation learning for low-rank MDPs.

2. PRELIMINARIES AND PROBLEM FORMULATION

Notation. For any H ∈ N, we denote [H] := {1, . . . , H}. For any vector x and symmetric matrix A, we denote ∥x∥ 2 as the ℓ 2 norm of x and ∥x∥ A := √ x ⊤ Ax. For any matrix A, we denote ∥A∥ F as its Frobenius norm and let σ i (A) be its i-th largest singular value. For two probability measures P, Q ∈ Ω, we use ∥P -Q∥ T V to denote their total variation distance.

2.1. EPISODIC MDPS

We consider an episodic Markov decision process (MDP) M = (S, A, H, P, r), where S can be an arbitrarily large state space; A is a finite action space with cardinality K; H is the number of steps in each episode; P : S × A × S → [0, 1] is the time-dependent transition kernel, where P h (s h+1 |s h , a h ) denotes the transition probability from the state-action pair (s h , a h ) at step h to state s h+1 in the next step; r h : S × A → [0, 1] denotes the deterministic reward function at step h; We further normalize the summation of reward function as H h r h ≤ 1. A policy π is a set of mappings {π h : S → ∆(A)} h∈[H] , where ∆(A) is the set of all probability distributions over the action space A. Further, a ∼ U(A) indicates the uniform selection of an action a from A. In each episode of the MDP, we assume that a fixed initial state s 1 is drawn. Then, at each step h ∈ [H]. the agent observes state s h ∈ S, takes an action a h ∈ A under a policy π h , and receives a reward r h (s h , a h ) (in the reward-known setting), and then the system transits to the next state s h+1 with probability P h (s h+1 |s h , a h ). The episode ends after H steps. As standard in the literature, we use s h ∼ (P, π) to denote a state sampled by executing the policy π under the transition kernel P for h -1 steps. If the previous state-action pair (s h-1 , a h-1 ) is given, we use s h ∼ P to denote that s h follows the distribution P h (•|s h-1 , a h-1 ). We use the notation E (s h ,a h )∼(P,π) [•] to denote the expectation over states s h ∼ (P, π) and actions a h ∼ π. For a given policy π and an MDP M = (S, A, H, P, r), we denote the value function starting from state s h at step h as V π h,P,r (s h ) := E (s h ′ ,a h ′ )∼(P,π) H h ′ =h r h ′ (s h ′ , a h ′ )|s h . We use V π P,r to denote V π 1,P,r (s 1 ) for simplicity. Similarly, we denote the action-value function starting from state action pair (s h , a h ) at step h as Q π h,P,r (s h , a h ) := r h (s h , a h ) + E s h+1 ∼P V π h+1,P,r (s h+1 )|s h , a h . We use P ⋆ to denote the transition kernel of the true environment and for simplicity, denote E (s h ,a h )∼(P ⋆ ,π) [•] as E ⋆ π [•]. Given a reward function r, there always exists an optimal policy π ⋆ that yields the optimal value V ⋆ P ⋆ ,r = sup π V π P ⋆ ,r .

2.2. LOW-RANK MDPS

This paper focuses on the low-rank MDPs (Agarwal et al., 2020 ) defined as follows. Definition 1. (Low-rank MDPs). A transition probability P ⋆ h : S × A → ∆(A) admits a low-rank decomposition with dimension d ∈ N if there exists two embedding functions ϕ ⋆ h : S × A → R d and µ ⋆ h : S → R d such that P ⋆ h (s ′ |s, a) = ⟨ϕ ⋆ h (s, a), µ ⋆ h (s ′ )⟩ , ∀s, s ′ ∈ S, a ∈ A. For normalization, we assume ∥ϕ ⋆ h (s, a)∥ 2 ≤ 1 for all (s, a), and for any function g : S → [0, 1], ∥ µ ⋆ h (s)g(s)ds∥ 2 ≤ √ d. An MDP M is a low-rank MDP with dimension d if for each h ∈ [H], P h admits a low-rank decomposition with dimension d. We use ϕ ⋆ = {ϕ ⋆ h } h∈[H] and µ ⋆ = {µ ⋆ h } h∈[H] to denote the embeddings for P ⋆ . We remark that when ϕ h is revealed to the agent, low-rank MDPs specialize to linear MDPs (Wang et al., 2020; Jin et al., 2020b) . Essentially, low-rank MDPs do not assume that the features {ϕ h } h are known a priori. The lack of knowledge on features in fact invokes a nonlinear structure, which makes the model strictly harder than linear MDPs or tabular models. Since it is impossible to learn a model in polynomial time if there is no assumption on features ϕ h and µ h , we adopt the following conventional assumption from the recent studies on low-rank MDPs. Assumption 1. (Realizability). A learning agent can access to a model class {(Φ, Ψ)} that contains the true model, i.e., the embeddings ϕ ⋆ ∈ Φ, µ ⋆ ∈ Ψ. While we assume the cardinality of the function classes to be finite for simplicity, extensions to infinite classes with bounded statistical complexity (such as bounded covering number) are not difficult (Sun et al., 2019; Agarwal et al., 2020) .

2.3. REWARD-FREE RL AND LEARNING OBJECTIVES

Reward-free RL typically has two phases: exploration and planning. In the exploration phase, an agent explores the state space via interaction with the true environment and can collect samples over multiple episodes, but without access to the reward information. In the planning phase, the agent is no longer allowed to interact with the environment, and for any given reward function, is required to achieve certain learning goals (elaborated below) based on the outcome of the exploration phase. The planning phase may require the agent to achieve different learning goals. In this paper, we focus on three of such goals. The most popular goal in reward-free RL is to find a near-optimal policy that achieves the best value function under the true environment with ϵ-accuracy, as defined below. Definition 2. (ϵ-optimal policy). Fix ϵ > 0. For any given reward function r, a learned policy π is ϵ-optimal if it satisfies V π ⋆ P ⋆ ,r -V π P ⋆ ,r ≤ ϵ. For model-based learning, Agarwal et al. (2020) proposed system identification as another useful learning goal, defined as follows. Definition 3 (ϵ-accurate system identification). Fix ϵ > 0. Given a model class (Φ, Ψ), a learned model ( φ, μ) is said to achieve ϵ-accurate system identification if it uniformly approximates the true model P ⋆ , i.e., ∀π, h ∈ [H], E ⋆ π φh (s h , a h ), μh (•) -P ⋆ h (•|s h , a h ) T V ≤ ϵ. Besides those two common learning goals, we also propose an additional goal on near-accurate representation learning. Towards that, we introduce the following divergence-based metric to quantify distance between two representations, which has been used in supervised learning (Du et al., 2021b) . Definition 4. (Divergence between two representations). Given a distribution q over S × A and two representations ϕ, ϕ ′ ∈ Φ, define the covariance between ϕ and ϕ ′ w.r.t q as Σ (s,a)∼q (ϕ, ϕ ′ ) = E ϕ(s, a)ϕ ′ (s, a) ⊤ . Then, the divergence between ϕ and ϕ ′ with respect to q is defined as D q (ϕ, ϕ ′ ) = Σ (s,a)∼q (ϕ ′ , ϕ ′ ) -Σ (s,a)∼q (ϕ ′ , ϕ) Σ (s,a)∼q (ϕ, ϕ) † Σ (s,a)∼q (ϕ, ϕ ′ ). It can be verified that D q (ϕ, ϕ ′ ) ⪰ 0 (i.e., positive semidefinite) and D q (ϕ, ϕ) = 0 for any ϕ, ϕ ′ ∈ Φ.

3. LOWER BOUND ON SAMPLE COMPLEXITY

In this section, we provide a lower bound on the sample complexity that all reward-free RL algorithms must satisfy under low-rank MDPs. The detailed proof can be found in Appendix C. Theorem 1 (Lower bound). For any algorithm that can output an ϵ-optimal policy (as Definition 2), if H > max(24ϵ, 4), S ≥ 6, K ≥ 3 and δ < 1/16, then there exists a low-rank MDP model M such that the number of trajectories sampled by the algorithm is at least Ω HdK ϵ 2 . To the best of our knowledge, Theorem 1 establishes the first lower bound for learning low-rank MDPs in the reward-free setting. More importantly, Theorem 1 shows that it is strictly more costly in terms of sample complexity to find near-optimal policies under low-rank MDPs (which have unknown representations) than linear MDPs (which have known representations) by at least a factor of Ω(K). This can be seen as the lower bound in Theorem 1 for low-rank MDPs has an additional term K compared to the upper bound Õ d 3 H 4 ϵ 2 provided in Wang et al. (2020) for linear MDPs. This can be explained intuitively as follows. In linear models, all the representations ϕ : S×A → R d are known. Then, it requires at most O(d) actions with linearly independent features to realize all transitions ⟨ϕ, µ⟩. However, learning low-rank MDPs requires the agent to further select O(K) actions to access the unknown features, leading to a dependence on K. Our proof of the new lower bound mainly features the following two novel ingredients in the construction of hard MDP instances. a) We divide the actions into two types. The first type of actions is mainly used to form a large state space through a tree structure. The second type of actions is mainly used to distinguish different MDPs. Such a construction allows us to separately treat the state space and the action space, so that both state space and action space can be arbitrarily large. b) We explicitly define the feature vectors for all state-action pairs, and more importantly, the dimension is less than or equal to the number of states. These two ingredients together guarantee that the number of actions K can be arbitrarily large and independent with other parameters d and S, which indicates that the dependence on the number of actions K is unavoidable.

4. THE RAFFLE ALGORITHM

In this section, we propose RAFFLE (see Algorithm 1) for reward-free RL under low-rank MDPs.

Summary of design novelty:

The central idea of RAFFLE lies in the construction of a novel exploration-driven reward, which is desirable because its corresponding value function serves as an upper bound on the model estimation error during the exploration phase. Hence, such a pseudoreward encourages the exploration to collect samples over those state-action space where the model estimation error is large so that later stage of the algorithm can further reduce such an error based on those samples. Such reward construction are new for low-rank MDPs, and serve as the key enabler for our improved sample complexity. They also necessitate various new ingredients in other steps of algorithms, as elaborated below. Exploration and MLE model estimation. In each iteration n during the exploration phase, for each h ∈ [H], the agent executes the exploration policy π n-1 (defined in the previous iteration) up to step h -1, after which it takes two uniformly selected actions, and stops after step h + 1. Different from FLAMBE (Agarwal et al., 2020) that collects a large number of samples for each episode, our algorithm uses each exploration policy to collect only one sample trajectory at each episode, indexed by (n, h). Hence, the sample complexity of RAFFLE is much smaller than that of FLAMBE. In fact, such an efficient sampling together with our new termination idea introduced later benefit sample complexity. Design of exploration reward. The agent updates the empirical covariance matrix Û (n) h as Û (n) h = λ n I + n τ =1 φ(n) h (s (τ,h+1) h , a (τ,h+1) h )( φ(n) h (s (τ,h+1) h , a (τ,h+1) h )) ⊤ , where {s (τ,h+1) h , a (τ,h+1) h , s (τ,h+1) h+1 } is collected at iteration τ , episode (h + 1), and step h. Next, the agent uses both φ(n) h and Û (n) h to construct an exploration-driven reward function as b(n) h (s, a) = min{α n ∥ φ(n) h (s, a)∥ ( Û (n) h ) -1 , 1}, where αn is a pre-determined parameter. We note that although individual b(n) h (s, a) for each step may not represent point-wise uncertainty as indicated in Uehara et al. (2022b) , we find its total cumulative version V π P (n) , b(n) can serve as a trajectory-wise uncertainty measure to select exploration policy. To see this, it can be shown that for any π and h, E ⋆ π φ(n) h (s h , a h ), μ(n) h (•) -P ⋆ h (•|s h , a h ) T V ≤ c ′ V πn P (n) , b(n) + cn n , where c ′ is a constant and c n = O(log n). As iteration number n grows, the second term diminishes to zero, which indicates that V πn P (n) , b(n) (under the reward of b(n) ) serves as a good upper bound on the estimation error for the true transition kernel. Hence, exploration guided by maximizing V π P (n) , b(n) will collect more trajectories over which the learned transition kernels are not estimated well. This will help to reduce the model estimation error in the future. Algorithm 1 RAFFLE (RewArd-Free Feature LEarning) for h = 1, . . . , H do 1: Input: αn , ζ n , ϵ > 0, δ ∈ (0, 1), regularizer λ n , model classes {(µ, ϕ) : µ ∈ Ψ, ϕ ∈ Φ}. 6: Use π n-1 : roll into s h-1 , uniformly choose a h-1 , a h , enter into s h , s h+1 . 7: Collect data s (n,h) 1 , a (n,h) 1 , . . . , s (n,h) h , a (n,h) h , s (n,h) h+1 . 8: Add the triple (s (n,h) h , a (n,h) h , s (n,h) h+1 ) to the dataset D n h = D n-1 h ∪ {(s (n,h) h , a (n,h) h , s (n,h) h+1 )}. 9: Learn ( φ(n) h , μ(n) h ) = MLE(D n h ). 10: Update transition dynamics P (n) as P (n) h (s ′ |s, a) = ⟨ φ(n) h (s, a), μ(n) h (s ′ )⟩. 11: end for 12: Update empirical covariance matrix Û (n) h as in Equation (1). 13: Define exploration-driven reward function b(n) h as in Equation ( 2). 14: Define an estimated value function V π P (n) , b(n) based on P (n) and b(n) as in Equation (3). 15: Find exploration policy π n = arg max π V π P (n) , b(n) . 16: if 2 V πn P (n) , b(n) + 2 √ Kζ n ≤ ϵ then 17: Terminate Phase I: Exploration Phase and set P ϵ = P (n) , bϵ = b(n) , π ϵ = π n , n ϵ = n. 18: end if 19: end for 20: Phase II: Planning Phase 21: Option 1 (learn near-optimal policy): Receive reward function r = {r h } H h=1 , and compute policy π = arg max π V π P ϵ ,r . 22: Option 2 (system identification): let P = P ϵ . 23: Option 3 (learn near-accurate representation): call Algorithm 2 of RepLearn and obtain φ. 24: Output: policy π, learned transition dynamics P ϵ , learned representation φ. Design of exploration policy. The agent defines a truncated value function iteratively using the estimated transition kernel and the exploration-driven reward as follows: Qπ h, P (n) , b(n) (s h , a h ) = min 1, b(n) h (s h , a h ) + P (n) h V π h+1, P (n) , b(n) (s h , a h ) , V π h, P (n) , b(n) (s h ) = E π Qπ h, P (n) , b(n) (s h , a h ) . The truncation technique here is important for the improvement of the sample complexity on the dependence of H. The agent finally finds an optimal policy maximizing V π P (n) , b(n) , and uses this policy as the exploration policy for the next iteration. Novel termination criterion. RAFFLE does not require a pre-determined maximum number of iterations as its input. Instead, it will terminate and output the current estimated model if the optimal value function V π P (n) , b(n) plus a minor term is below a threshold. Such a termination criterion essentially guarantees that the value functions under the estimated and true models are close to each other under any reward and policy, hence the exploration can be terminated in finite steps. Such a termination criterion enables our algorithm to identify an accurate model and later find a near-optimal policy with fewer sample collections than FLAMBE in Agarwal et al. (2020) as we discuss in the exploration phase. Additionally, our termination criterion provides strong performance guarantees on the output policy and estimator from the last iteration. On the contrary, Uehara et al. (2022b) can only provide guarantees on a random mixture of the policies obtained over all iterations. Planning phase. Given any reward function r, the agent finds a near-optimal policy by planning with the learned transition dynamics P ϵ and the given reward r. Note that such planning with a known low-rank MDP is computationally efficient by assumption.

5. UPPER BOUNDS ON SAMPLE COMPLEXITY

In this section, we first show that the policy returned by RAFFLE is an ϵ-optimal policy with respect to any given reward r in the planning phase. The detailed proof can be found in Appendix A. Theorem 2 (ϵ-optimal policy). Assume M is a low-rank MDP with dimension d, and Assumption 1 holds. Given any ϵ, δ ∈ (0, 1), and any reward function r, let π and P ϵ be the output of RAFFLE and π ⋆ := arg max π V π P ⋆ ,r be the optimal policy under the true model P ⋆ . Set αn = Õ( √ K + d 2 ) and λ n = Õ(d). Then, with probability at least 1 -δ, we have V π ⋆ P ⋆ ,r -V π P ⋆ ,r ≤ ϵ, and the total number of trajectories collected by RAFFLE is upper bounded by Õ( H 3 d 2 K(d 2 +K) ϵ 2 ). We note that the upper bound in Theorem 2 matches the lower bound in Theorem 1 in terms of the dependence on ϵ as well as on K in the large d regime. Compared with MOFFLE (Modi et al., 2021) , which also finds an ϵ-optimal policy in reward-free RL, our result improves their sample complexity of Õ H 5 K 5 d 3 LV ϵ 2 η in three aspects. First, the order on K is reduced. Second, the dimension d LV of the underlying function class in MOFFLE can be exponentially larger than d as shown in Agarwal et al. (2020) . Finally, MOFFLE requires reachability assumption, leading to a factor 1/η in the sample complexity, which can be as large as √ d LV . Further, Theorem 2 naturally achieves the goal of reward-known RL with the same sample complexity, which improves that of Õ H 5 d 4 K 2 ϵ 2 in Uehara et al. (2022b) by a factor of O(H 2 ). Proceeding to the learning objective of system identification, the learned transition kernel output by Algorithm 1 achieves the goal of ϵ-accurate system identification with the same sample complexity as follows. The detailed proof can be found in Appendix B. Theorem 3 (ϵ-accurate system identification). Under the same condition of Theorem 2 and let P ϵ = { φϵ h , μϵ h } be the output of RAFFLE. Then, with probability at least 1 -δ, P ϵ achieves ϵ-accurate system identification, i.e. for any π and h: E ⋆ π φϵ h (s h , a h ), μϵ h (•) -P ⋆ h (•|s h , a h ) T V ≤ ϵ, and the number of trajectories collected by RAFFLE is upper bounded by Õ( H 3 d 2 K(d 2 +K) ϵ 2 ). Theorem 3 significantly improves the sample complexity of Õ( H 22 K 9 d 7 ϵ 10 ) in Agarwal et al. ( 2020) on the dependence of all involved parameters for achieving ϵ-accurate system identification.

6. NEAR-ACCURATE REPRESENTATION LEARNING

In low-rank MDPs, it is of great interest to learn the representation ϕ accurately, because other similar RL environments can very likely share the same representation (Rusu et al., 2016; Zhu et al., 2020; Dayan, 1993) and hence such learned representation can be directly reused in those environments. Thus, the third objective of RAFFLE in the planning phase is to provide an accurate estimation of ϕ. We note that although RAFFLE provides an estimation of ϕ during its execution, such an estimation does not come with an accuracy guarantee. Besides, Theorem 3 on system identification does not provide the guarantee on the representation φ, but only on the entire transition kernel P . Further, none of previous studies of reward-free RL under low-rank MDPs (Agarwal et al., 2020; Modi et al., 2021; Uehara et al., 2022b) established the guarantee on φ.

6.1. THE REPLEARN ALGORITHM

In this section, we present the following algorithm of RepLearn, which exploits the learned transition kernel from RAFFLE and learns a near-accurate representation without additional interaction with the environment. The formal version of RepLearn, Algorithm 2, is delayed in Appendix D. We explain the main idea of Algorithm 2 as follows. First, for each h ∈ [H], t ∈ [T ], where T is the number of rewards, N f pairs of state-action (s h , a h ) are generated based on distribution q h . Note that the agent does not interact with the true environment during such data generation. Then, for any h, if we set the reward r at step h to be zero, Q π P ⋆ ,h,r (s h , a h ) can have a linear structure in terms of the true representation ϕ ⋆ h (s h , a h ). Namely, there exists a w h decided by r, P ⋆ and π such that Q π P ⋆ ,h,r (s h , a h ) = ⟨ϕ ⋆ h (s h , a h ), w h ⟩. Then, with the estimated transition kernel P that RAFFLE provides and the well-designed rewards r h,t , Q π t P ,h,r h,t (s h , a h ) can be computed efficiently and serve as a target to learn the representation via the following regression problem: arg min ϕ h ∈Φ,w t h ∈R d t∈[T ] (s h ,a h )∈D t h (Q π t P ,h,r h,t (s h , a h ) -⟨ϕ h (s h , a h ), w t h ⟩) 2 . ( ) The main difference between our algorithm and that in Lu et al. (2021) for representation learning is that, our data generation is based on P from RAFFLE, which carries a natural estimation error but requires no interaction with the environment, whereas their algorithm assumes a generative model to collect data from the ground-truth transition kernel.

6.2. GUARANTEE ON ACCURACY

In order to guarantee the learned representation by Algorithm 2 is sufficiently close to the ground truth, we need to employ the following two somewhat necessary assumptions. First, for the distributions of the state-action pairs {q h } H h=1 in Algorithm 2, it is desirable to have P (•|s, a) approximates true P ⋆ (•|s, a) well over those distributions, so that Q π P ,h,r (s h , a h ) can approximate the ground truth well. Intuitively, if some state-action pairs can hardly be visited under any policy, the output P (•|s, a) of Algorithm 1 cannot approximate true P ⋆ (•|s, a) well over these state-action pairs. Hence, we assume reachability type assumptions for MDPs as following so that all state-action pairs is likely to be visited by certain policy. Such reachability type assumption is common in relevant literature (Modi et al., 2021; Agarwal et al., 2020) . As discussed in Section 6, intuitively, if some state action pairs can be hardly visited by any policy, the output of Algorithm 1 P (•|s, a) can not approximate true P ⋆ (•|s, a) well over these state action pairs, so a standard reachability type assumption is necessary so that all states can be visited. Assumption 2 (Reachability). For the true transition kernel P ⋆ , there exists a policy π 0 such that min s∈S P π 0 h (s) ≥ η min , where P π 0 h (•) : S → R is the density function over S using policy π 0 to roll into state s at timestep h. We further assume the input distributions {q h } H h=1 are bounded with constant C B . Then, together with Assumption 2, for any (s, a) ∈ S ×A, we have q h (s, a) ≤ C min P π 0 h (s, a), where C min = C B ηmin . Next, we assume that the rewards chosen for generating target Q-functions are sufficiently diverse so that the target Q π P ⋆ ,h,r (s h , a h ) spans over the entire representation space to guarantee accurate representation learning in Equation (4). Such an assumption is commonly adopted in multi-task representation learning literature (Du et al., 2021b; Yang et al., 2021; Lu et al., 2021) . To formally state the assumption, for any h ∈ [H], let {r h,t } t∈[T ] be a set of T rewards (where T ≥ d) satisfying r h,t h = 0. As a result, for each t, given π t , there exists w t h ⋆ ∈ R d such that Q π P ⋆ ,h,r h,t (s h , a h ) = ⟨ϕ ⋆ h (s h , a h ), w t h ⋆ ⟩. Let W ⋆ h = [w 1 h ⋆ , . . . , w T h ⋆ ] ∈ R d×T . Assumption 3 (Diverse rewards). The smallest singular value σ d (W ⋆ h ) of W ⋆ h defined above satis- fies σ 2 d (W ⋆ h ) ≥ Ω( T d ), i.e., there exists a constant C D > 0 such that σ 2 d (W ⋆ h ) ≥ C D T d . We next characterize the accuracy of the output representation of Algorithm 2 in terms of the divergence Definition 4 between the learned and the ground truth representations in the following theorem and defer the proof to Appendix D. Theorem 4 (Guarantee for representation learning). Under Assumption 1 and Assumption 3, for any ϵ, δ ∈ (0, 1), any h ∈ [H], and sufficiently large N f , let P be the output transition kernel of RAFFLE that satisfies Theorem 3. With probability at least 1 -δ, the output φ of RAFFLE satisfies D q h (ϕ ⋆ h , φh ) 1/2 2 F = O( ϵdCmin C D + d C D log 2 δ T N f ). We explain the bound in Equation ( 5) as follows. The first term arises from the system identification error ϵ by using the output of RAFFLE, and can be made as small as possible by choosing appropriate ϵ. The second term is related to the randomness caused by sampling state-action pairs from input distribution {q h } h∈[H] , and vanishes as N f becomes large. Note that N f is the number of simulated samples in Algorithm 2, which does not require interaction with the true environment. Hence, it can be made sufficiently large to guarantee a small error. Theorem 4 shows RAFFLE can learn a near-accurate representation, based on which learned representations can be further used in other RL environments sharing common representations, similarly in spirit to how representation learning has been exploited in supervised learning (Du et al., 2021b) .

7. RELATED WORK

Reward-free RL. While various studies (Oudeyer et al., 2007; Bellemare et al., 2016; Burda et al., 2018; Colas et al., 2018; Nair et al., 2018; Eysenbach et al., 2018; Co-Reyes et al., 2018; Hazan et al., 2019; Du et al., 2019; Pong et al., 2019; Misra et al., 2020) proposed exploration algorithms for good coverage on the state space without using explicit reward signals, theoretically speaking, the paradigm of reward-free RL was first formalized by Jin et al. (2020a) , where they provided both upper and lower bounds on the sample complexity. For tabular case, several follow-up studies (Kaufmann et al., 2021; Ménard et al., 2021) 2021) proposed a reward-free approach to solving constrained RL problems with any given reward-free RL oracle under both the tabular and linear MDPs. Reward-free RL under low-rank MDPs. As discussed in Section 1, reward-free RL under lowrank MDPs have been studied recently (Agarwal et al., 2020; Modi et al., 2021) , and our result significant improves the sample complexity therein. When finite latent state space is assumed, lowrank MDPs specialize to block MDPs, under which algorithms termed as PCID (Du et al., 2019) and HOMER (Misra et al., 2020) achieved sample complexities of Õ( d 4 H 2 K 4 min(η 4 γ 2 n ,ϵ 2 ) ) and Õ( d 8 H 4 K 4 min(η 3 ,ϵ 2 ) ), respectively. Our result on the general low-rank MDPs can be utilized to further improve those results for block MDPs. Reward-known RL under low-rank MDPs. For reward-known RL, Uehara et al. (2022b) proposed a computationally efficient algorithm REP-UCB under low-rank MDPs. Our design of the reward-free algorithm for low-rank MDPs is inspired by their algorithm with several new ingredients as discussed in Section 4, and improves their sample complexity on the dependence of H. Meanwhile, algorithms have been proposed for MDP models with low Bellman rank (Jiang et al., 2017) , low witness rank (Sun et al., 2019) , bilinear classes (Du et al., 2021a) and low Bellman eluder dimension (Jin et al., 2021) , which can specialize to low-rank MDPs. However, those algorithms are computationally more costly as remarked in Uehara et al. (2022b) , although their sample complexity may have sharper dependence on d, K or H. Specializing to block MDPs, Zhang et al. (2022) proposed an algorithm called BRIEE, which empirically achieves the state-of-art sample complexity for block MDP models. Besides, Zhang et al. (2021a) proposed an algorithm coined ReLEX for a slightly different low-rank model, and obtained a problem dependent regret upper bound.

8. CONCLUSIONS

In this paper, we investigate the reward-free reinforcement learning, in which the underlying model admits a low-rank structure. Without further assumptions, we propose an algorithm called RAFFLE which significantly improves the state-of-the-art sample complexity for accurate model estimation and near-optimal policy identification. We further design an algorithm to exploit the model learned by RAFFLE for accurate representation learning without further interaction with the environment. Although ϵ-accurate system identification can easily induce Hϵ-optimal policy, the relationship between these two learning goals under general reward-free exploration setting remain under-explored, and is an interesting topic for future investigation.

A PROOF OF THEOREM 2

We first provide a proof outline to highlight our key ideas in the analysis of Theorem 2, and then provide the detailed proof. To simplify the notation, we denote the total variation distance P (n) h (s h , a h ) -P ⋆ h (s h , a h ) T V by f (n) h (s h , a h ). Proof Outline for Theorem 2. Step 1 provides an upper bound on the difference of value functions under estimated model P (n) and the true model P ⋆ for any given policy π and reward r, as given in the following proposition (see proposition 4 in appendix A.2). Proposition 1 (Informal). There exist constants c n = O(log n) and c ′ = O(1) such that for any policy π and reward r, with high probability, we have V π P ⋆ ,r -V π P (n) ,r ≤ V πn P (n) , b(n) + c n n . This proposition is inspired by Uehara et al. (2022b) , while it generalizes to non-stationary setting and arbitrary reward scenario from infinite stationary MDP and fixed reward. The main proof idea is to first notice that for any reward, we have V π P ⋆ ,r -V π P (n) ,r ≤ V π P (n) ,f (n) . Then, we show that V π P (n) ,f (n) ≤ V π P (n) , b(n) + cn n ≤ V πn P (n) , b(n) + cn n due to the optimality of the exploration policy π n . Step 2 shows sublinearity of the summation of V πn  P (n) , b(n) , V πn P (n) , b(n) ≤ Õ Hd K(K + d 2 )N . Step 3 combines Step 2 and Step 1. We are able to conclude that RAFFLE will terminate with polynomial sample complexity such that, the value function difference of P ⋆ and the returned environment P ϵ is at most ϵ. Proposition 3. With high probability, RAFFLE terminates after at most Õ H 2 d 2 K(d 2 + K)/ϵ 2 iterations, and the output model P ϵ satisfies V π P ⋆ ,r -V π P (ϵ) ,r ≤ V πn ϵ P (ϵ) , b(nϵ) + c nϵ n ϵ ≤ ϵ/2, where n ϵ is the iteration number where RAFFLE terminates, i.e. P ϵ = P (nϵ) . Finally, with some algebraic operations, Proposition 3 concludes the proof of Theorem 2.

A.1 SUPPORTING LEMMAS

We first present the high probability event. Lemma 1. We define Π n to be a uniform mixture of previous n -1 exploration policies: Π n = U(π 0 , π 1 , ..., π n-1 ). Denote the total variation of P (n) and P ⋆ , and the expected matrix of Û (n) h as follows. f (n) h (s h , a h ) = P ⋆ h (•|s h , a h ) - P (n) h (•|s h , a h ) T V , U (n) h,ϕ = n E s h ∼(P ⋆ ,Πn ) a h ∼U (A) ϕ(s h , a h )(ϕ(s h , a h )) ⊤ + λ n I, W (n) h,ϕ = n E (s h ,a h )∼(P ⋆ ,Πn) ϕ(s h , a h )(ϕ(s h , a h )) ⊤ + λ n I. ( ) where λ n = β 3 d log(2nH|Φ|/δ)) and β 3 = O(1) is a constant coefficient. Suppose Algorithm 1 runs N iterations and let events E 0 and E 1 be defined as follows. E 0 = ∀n ∈ [N ], h ∈ [H], s h ∈ S, a h ∈ A, E s h ∼(P ⋆ ,Πn) a h ∼U f (n) h (s h , a h ) 2 ≤ ζ n , E 1 = ∀n ∈ [N ], h ∈ [H], s h ∈ S, a h ∈ A, 1 5 φ(n) h-1 (s, a) (U (n) h-1, φ ) -1 ≤ φ(n) h-1 (s, a) ( Û (n) h-1 ) -1 ≤ 3 φ(n) h-1 (s, a) (U (n) h-1, φ ) -1 . where  P[E 1 ] ≥ 1 -δ/2. Therefore, P[E] ≥ 1 -δ. Based on Lemma 1, we can bound the exlporation-driven reward in RAFFLE as follows. Corollary 1. Given that the event E occurs, the following inequality holds for any n ∈ [N ], h ∈ [H], s h ∈ S, a h ∈ A: min αn 5 φ(n) h (s h , a h ) (U (n) h, φ ) -1 , 1 ≤ b(n) h (s h , a h ) ≤ 3 αn φ(n) h (s h , a h ) (U (n) h, φ ) -1 , where αn = 5 2β 3 nζ n (K + d 2 ). Proof. Recall b(n) h (s h , a h ) = min αn φ(n) h (s, a) ( Û (n) h ) -1 , 1 . Applying Lemma 1, we can immediately obtain the result. The following lemma extends the Lemmas 12 and 13 under infinite discount MDPs in Uehara et al. (2022b) to episodic MDPs. We provide the proof for completeness. Lemma 2. Let P h-1 = ⟨ϕ h-1 , µ h-1 ⟩ be a generic MDP model, and Π be an arbitrary and possibly mixture policy. Define an expected Gram matrix as follows M h-1,ϕ = λ n I + n E s h-1 ∼(P ⋆ ,Π) a h-1 ∼Π ϕ h-1 (s h-1 , a h-1 ) (ϕ h-1 (s h-1 , a h-1 )) ⊤ . Further, let f h-1 (s h-1 , a h-1 ) be the total variation between P ⋆ h-1 and P h-1 at time step h -1. Suppose g ∈ S × A → R is bounded by B ∈ (0, ∞), i.e., ∥g∥ ∞ ≤ B. Then, ∀h ≥ 2, ∀ policy π h , E s h ∼P h-1 a h ∼π h [g(s h , a h )|s h-1 , a h-1 ] ≤ ∥ϕ h-1 (s h-1 , a h-1 )∥ (M h-1,ϕ ) -1 × nK E s h ∼(P ⋆ ,Π) a h ∼U [g 2 (s h , a h )] + λ n dB 2 + nB 2 E s h-1 ∼(P ⋆ ,Π) a h-1 ∼Π [f h-1 (s h-1 , a h-1 ) 2 ]. Proof. We first derive the following bound: E s h ∼P h-1 a h ∼π h [g(s h , a h )|s h-1 , a h-1 ] = s h a h g(s h , a h )π(a h |s h )⟨ϕ h-1 (s h-1 , a h-1 ), µ h-1 (s h )⟩ds h ≤ ∥ϕ h-1 (s h-1 , a h-1 )∥ (M h-1,ϕ ) -1 a h g(s h , a h )π(a h |s h )µ h-1 (s h )ds h M h-1,ϕ , where the inequality follows from Cauchy-Schwarz inequality. We further expand the second term in the RHS of the above inequality as follows. a h g(s h , a h )π(a h |s h )µ h-1 (s h )ds h 2 M h-1,ϕ (i) ≤ n E s h-1 ∼(P ⋆ ,Π) a h-1 ∼Π   s h a h g(s h , a h )π h (a h |s h )µ(s h ) ⊤ ϕ(s h-1 , a h-1 )ds h 2   + λ n dB 2 = n E s h-1 ∼(P ⋆ ,Π) a h-1 ∼Π      E s h ∼P h-1 a h ∼π h g(s h , a h ) s h-1 , a h-1   2    + λ n dB 2 (ii) ≤ 2n E s h-1 ∼(P ⋆ ,Π) a h-1 ∼Π   E s h ∼P ⋆ h-1 a h ∼π h g(s h , a h ) s h-1 , a h-1 2   + λ n dB 2 + 2nB 2 E s h-1 ∼(P ⋆ ,Π) a h-1 ∼Π [f h-1 (s h-1 , a h-1 )] 2 (iii) ≤ 2n E s h-1 ∼(P ⋆ ,Π) a h-1 ∼Π   E s h ∼P ⋆ h-1 a h ∼π h g(s h , a h ) 2 s h-1 , a h-1   + λ n dB 2 + 2nB 2 E s h-1 ∼(P ⋆ ,Π) a h-1 ∼Π f h-1 (s h-1 , a h-1 ) 2 (iv) ≤ 2nK E s h ∼(P ⋆ ,Π) a h ∼U g(s h , a h ) 2 + λ n dB 2 + 2nB 2 E s h-1 ∼(P ⋆ ,Π) a h-1 ∼Π f h-1 (s h-1 , a h-1 ) 2 , where (i) follows from the assumption that ∥g∥ ∞ ≤ B, (ii) is due to that f h-1 (s h-1 , a h-1 ) is the total variation between P ⋆ h-1 and P h-1 at time step h -1 and the fact that (a + b) 2 ≤ 2a 2 + 2b 2 , (iii) follows from Jensen's inequality, and (iv) is due to importance sampling. This finishes the proof. Based on Lemma 2, we summarize three useful inequalities which bridges the total variation f  (n) h . Lemma 3. Define W (n) h,ϕ = n E s h ∼(P ⋆ ,Πn ) a h ∼Πn ϕ(s h , a h )(ϕ(s h , a h )) ⊤ + λ n I, where λ n = β 3 d log(2nH|Φ|/δ). Given that the event E occurs, the following inequalities hold. For any n, when h ≥ 2, E s h ∼ P (n) h-1 a h ∼π f (n) h (s h , a h ) s h-1 , a h-1 ≤ α n φ(n) h-1 (s h-1 , a h-1 ) (U (n) h-1, φ ) -1 , E s h ∼P * h-1 a h∼π f (n) h (s h , a h ) s h-1 , a h-1 ≤ α n ϕ * h-1 (s h-1 , a h-1 ) (U (n) h-1,ϕ ⋆ ) -1 , E s h ∼P * h-1 a h ∼π b(n) h (s h , a h ) s h-1 , a h-1 ≤ γ n ϕ * h-1 (s h-1 , a h-1 ) (W (n) h-1,ϕ ⋆ ) -1 , where α n = 2β 3 nζ n (K + d 2 ), γ n = 45β 3 nζ n Kd(K + d 2 ). Specially, when h = 1, E a1∼π f (n) 1 (s 1 , a 1 ) ≤ Kζ n , E a1∼π b(s 1 , a 1 ) ≤ 15α n dK n . ( ) Proof. We start by developing Equation (11) as follows. Given that the event E occurs, for h ≥ 2 we have E s h ∼ P (n) h-1 a h ∼π f (n) h (s h , a h ) s h-1 , a h-1 (i) ≤ φ(n) h-1 (s h-1 , a h-1 ) (U (n) h-1, φ ) -1 × nK E s h-1 ∼(P ⋆ ,Πn ) a h-1 ,a h ∼U s h ∼P ⋆ h (•|s h-1 ,a h-1 ) [f (n) h (s h , a h ) 2 ] + λ n d + n E s h-1 ∼(P ⋆ ,Πn) a h-1 ∼U f (n) h-1 (s h-1 , a h-1 ) 2 (ii) ≤ φ(n) h-1 (s h-1 , a h-1 ) (U (n) h-1, φ ) -1 × nK E s h-1 ∼(P ⋆ ,Πn ) a h-1 ,a h ∼U s h ∼P ⋆ h (•|s h-1 ,a h-1 ) [f (n) h (s h , a h ) 2 ] + λ n d + nK E s h-2 ∼(P ⋆ ,Πn) a h-2 ,a h-1 ∼U s h-1 ∼P ⋆ h-1 (•|s h-2 ,a h-2 ) f (n) h-1 (s h-1 , a h-1 ) 2 (iii) ≤ φ(n) h-1 (s h-1 , a h-1 ) (U (n) h-1, φ ) -1 2nζ n K + β 3 nζ n d 2 ≤ α n φ(n) h-1 (s h-1 , a h-1 ) (U (n) h-1, φ ) -1 , where (i) follows from Lemma 2 and the fact that f (n) h (s h , a h ) ≤ 1, (ii) follows from importance sampling at time step h -2, and (iii) follows from Lemma 1. Equation ( 12) follows from the arguments similar to the above. To obtain Equation ( 13), we first apply Lemma 2 and obtain E s h ∼P ⋆ h-1 a h ∼πn b(n) h (s h , a h ) s h-1 , a h-1 ≤ ϕ ⋆ h-1 (s h-1 , a h-1 ) (W (n) h-1,ϕ ⋆ ) -1 nK E s h ∼(P ⋆ ,Πn ) a h ∼U [{ b(n) h (s h , a h )} 2 ] + λ n d, where we use the fact that b (n) h (s h , a h ) ≤ 1. We further bound the term n E s h ∼(P ⋆ ,Πn ) a h ∼U [( b(n) h (s h , a h )) 2 ] as follows: n E s h ∼(P ⋆ ,Πn) a h ∼U b(n) h (s h , a h ) 2 ≤ n E s h ∼(P ⋆ ,Πn) a h ∼U α2 n φ(n) h (s h , a h ) 2 ( Û (n) h, φ ) -1 (i) ≤ n E s h ∼(P ⋆ ,Πn ) a h ∼U 9 α2 n φ(n) h (s h , a h ) 2 (U (n) h, φ ) -1 = 9α 2 n tr      n E s h ∼(P ⋆ ,Πn ) a h ∼U    φ(n) h (s h , a h ) φ(n) h (s h , a h ) ⊤   n E s h ∼(P ⋆ ,Πn ) a h ∼U φh (s h , a h ) φ(n) h (s h , a h ) ⊤ + λ n I   -1         ≤ 9α 2 n tr(I) = 9 α2 n d , where (i) follows from Lemma 1, and we use tr(A) to denote the trace of any matrix A.

Hence,

E s h ∼P * h-1 a h ∼π b(n) h (s h , a h ) s h-1 , a h-1 ≤ ϕ * h-1 (s h-1 , a h-1 ) (W (n) h-1,ϕ ⋆ ) -1 9K α2 n d + λ n d ≤ γ n ϕ * h-1 (s h-1 , a h-1 ) W (n) h-1,ϕ ⋆ ) -1 , where the last inequality follows from that αn = 5α n and the definition of γ n In addition, for h = 1, we have E a1∼πn f (n) 1 (s 1 , a 1 ) (i) ≤ K E a1∼U f (n) 1 (s 1 , a 1 ) 2 ≤ Kζ n , E a1∼πn b(s 1 , a 1 ) (ii) ≤ αn K E a1∼U ∥ φ1 (s 1 , a 1 )∥ 2 ( Û (n) 1, φ ) -1 ≤ 3α n K E a1∼U ∥ φ1 (s 1 , a 1 )∥ 2 (U (n) 1, φ ) -1 ≤ 3 25Kα 2 n d n = 15α n dK n ,

A.2 PROOF OF PROPOSITION 1

Equipped with Lemma 3, the following proposition provides an upper bound on the difference of value functions under the estimated model P (n) and the true model P ⋆ for any given policy π and reward r. Proposition 4 (Restatement of Proposition 1). For all n ∈ [N ], policy π and reward r, given that the event E occurs, we have V π P ⋆ ,r -V π P (n) ,r ≤ V π P (n) , b(n) + Kζ n . Proof. Step 1. We first show that V π P ⋆ ,r -V π P (n) ,r ≤ V π P (n) ,f (n) . Recall the definition of estimated value functions Vh, P (n) ,r (s h ) and Qh, P (n) ,r (s h , a h ): Qπ h, P (n) ,r (s h , a h ) = min 1, r h (s h , a h ) + P (n) h V π h+1, P (n) ,r (s h , a h ) , V π h, P (n) ,r (s h ) = E π Qπ h, P (n) ,r (s h , a h ) . We develop the proof by induction. For the base case h = H + 1, we have V π H+1, P (n) ,r (s H+1 ) -V π H+1,P ⋆ ,r (s H+1 ) = 0 = V π H+1, P (n) ,f (n) (s H+1 ). Assume that V π h+1, P (n) ,r (s h+1 ) -V π h+1,P ⋆ ,r (s h+1 ) ≤ V π h+1, P (n) ,f (n) (s h+1 ) holds for any s h+1 . Then, from Bellman equation, we have, Q π h, P (n) ,r (s h , a h ) -Q π h,P ⋆ ,r (s h , a h ) = P (n) h V π h, P (n) ,r (s h , a h ) -P ⋆ h V π h+1,P ⋆ ,r (s h , a h ) = P (n) h V π h+1, P (n) ,r -V π h+1,P ⋆ ,r (s h , a h ) + P (n) h -P ⋆ h V π h,P ⋆ ,r (s h , a h ) (i) ≤ min 1, f (n) h (s h , a h ) + P (n) h V π h+1, P (n) ,r -V π h+1,P ⋆ ,r (s h , a h ) (ii) ≤ min 1, f (n) h (s h , a h ) + P (n) h V π h+1, P (n) ,f (n) (s h , a h ) = Qπ h, P (n) ,f (n) (s h , a h ), where (i) follows from the (action) value function is at most 1, and (ii) follows from the induction hypothesis. Then, by the definition of V π h, P (n) ,r (s h ), we have V π h, P (n) ,r (s h ) -V π h,P ⋆ ,r (s h ) = E π Q π h, P (n) ,r (s h , a h ) -E π Q π h,P ⋆ ,r (s h , a h ) ≤ E π Q π h, P (n) ,r (s h , a h ) -Q π h,P ⋆ ,r (s h , a h ) (i) ≤ E π Qπ h, P (n) ,f (n) (s h , a h ) = V π h, P (n) ,f (n) (s h ) , where (i) follows from Equation ( 15). Therefore, by induction, we have V π P ⋆ ,r -V π P (n) ,r ≤ V π P (n) ,f (n) . Step 2. Then, we show that V π P (n) ,f (n) ≤ V π P (n) , b(n) + √ Kζ n . By Equation ( 11) and the fact that the total variation distance is upper bounded by 1, with probability at least 1 -δ/2, we have E P (n) ,π f (n) h (s h , a h ) s h-1 ≤ E a h-1 ∼π min α n φ(n) h-1 (s h-1 , a h-1 ) (U (n) h-1, φ ) -1 , 1 , ∀h ≥ 2. ( ) Similarly, when h = 1, E a1∼π f (n) 1 (s 1 , a 1 ) ≤ K E a∼U f (n) 1 (s 1 , a 1 ) 2 ≤ Kζ n . ( ) Based on Corollary 1, Equation ( 16) and α n = 5α n , we have E π b(n) h (s h , a h ) s h ≥ E π min α n φ(n) h (s h , a h ) (U (n) h, φ ) -1 , 1 ≥ E P (n) ,π f (n) h+1 (s h+1 , a h+1 ) s h . ( ) For the base case h = H, we have E P (n) ,π V π H, P (n) ,f (n) (s H ) s H-1 = E P (n) ,π f (n) H (s H , a H ) s H-1 ≤ E π b (n) H-1 (s H-1 , a H-1 )|s H-1 ≤ min 1, E π Qπ H-1, P (n) , b(n) (s H-1 , a H-1 ) s H-1 = V π H-1, P (n) , b(n) (s H-1 ). Assume that E P (n) ,π V π h+1, P (n) ,f (n) (s h+1 ) s h ≤ V π h, P (n) , b(n) (s h ) holds for step h + 1. Then, by Jensen's inequality, we obtain E P (n) ,π V π h, P (n) ,f (n) (s h ) s h-1 ≤ min 1, E P (n) ,π f (n) h (s h , a h ) + P (n) h V π h+1, P (n) ,f (n) (s h , a h ) s h-1 (i) ≤ min 1, E π b(n) h-1 (s h-1 , a h-1 ) + E P (n) ,π E P (n) ,π V π h+1, P (n) ,f (n) (s h+1 ) s h s h-1 (ii) ≤ min 1, E π b (n) h-1 (s h-1 , a h-1 ) + E P (n) ,π V π h, P (n) , b(n) (s h ) s h-1 = min 1, E π Qπ h-1, P (n) , b(n) (s h-1 , a h-1 ) = V π h-1, P (n) , b(n) (s h-1 ), where (i) follows from Equation ( 18), and (ii) is due to the induction hypothesis. By induction, we conclude that V π P (n) ,f (n) = E π f (s) 1 (s 1 , a 1 ) + E P (n) ,π V π 2, P (n) ,f (n) (s 2 ) s 1 ≤ Kζ n + V π P (n) , b(n) . Combining Step 1 and Step 2, we conclude that V π P ⋆ ,r -V π P (n) ,r ≤ Kζ n + V π P (n) , b(n) .

A.3 PROOF OF PROPOSITION 2

The following lemma is key to ensure that RAFFLE terminates in finite episodes. Proposition 5 (Restatement of Proposition 2). Given that the event E occurs, ζ = log (2|Φ||Ψ|N H/δ) the summation of the truncated value functions V πn P (n) , b(n) under exploration policies {π n } n∈[N ] is sublinear, i.e., the following bound holds: n∈[N ] V πn P (n) , b(n) + Kζ n ≤ 32ζHd β 3 K(d 2 + K)N . Proof. Note that V π h, P (n) , b(n) ≤ 1 holds for any policy π and h ∈ [H]. We first have n) . Applying the Equation (13) and Equation ( 14), we obtain the following bound on the value function V πn P (n) , b(n) -V πn P ⋆ , b(n) ≤ E πn P (n) 1 V πn 2, P (n) , b(n) (s 1 , a 1 ) -P ⋆ 1 V πn 2,P ⋆ , b(n) (s 1 , a 1 ) = E πn P (n) 1 -P ⋆ 1 V πn 2, P (n) , b(n) (s 1 , a 1 ) + P ⋆ 1 V πn 2, P (n) , b(n) -V πn 2,P ⋆ , b(n) (s 1 , a 1 ) ≤ E πn f (n) 1 (s 1 , a 1 ) + P ⋆ 1 V πn 2, P (n) , b(n) -V πn 2,P ⋆ , b(n) ≤ . . . ≤ E (s h ,a h )∼(P ⋆ ,πn) H h=1 f (n) (s h , a h ) = V πn P ⋆ ,f (n) , which implies V πn P (n) , b(n) ≤ V πn P ⋆ , b(n) + V πn P ⋆ ,f V πn P ⋆ , b(n) : V πn P ⋆ , b(n) = H h=1 E s h ∼(P ⋆ ,πn ) a h ∼πn bn (s h , a h ) ≤ H h=2 E s h-1 ∼(P ⋆ ,πn ) a h-1 ∼πn γ n ϕ ⋆ h-1 (s h-1 , a h-1 ) (W (n) h-1,ϕ ⋆ ) -1 + 15α n dK n ≤ H h=1 E s h ∼(P ⋆ ,πn) a h ∼πn γ n ∥ϕ ⋆ h (s h , a h )∥ (W (n) h,ϕ ⋆ ) -1 + 15α n dK n . Similarly, we obtain V πn P ⋆ ,f (n) = H h=1 E s h ∼(P ⋆ ,πn ) a h ∼πn f (n) h (s h , a h ) ≤ H h=2 E s h-1 ∼(P ⋆ ,πn) a h-1 ∼πn α n ϕ ⋆ h-1 (s h-1 , a h-1 ) (U (n) h-1,ϕ ⋆ ) -1 + Kζ n ≤ H h=1 E s h ∼(P ⋆ ,πn ) a h ∼πn α n ∥ϕ ⋆ h (s h , a h )∥ (U (n) h,ϕ ⋆ ) -1 + Kζ n . Then, taking the summation of V πn P ⋆ , b(n) +f (n) over n ∈ [N ], we have n∈[N ] V πn P ⋆ ,f (n) + b(n) + Kζ n ≤ n∈[N ] 15α n dK n + 2 n∈[N ] Kζ n + n∈[N ] H h=1 E s h ∼(P ⋆ ,πn) a h ∼πn γ n ∥ϕ ⋆ h (s h , a h )∥ (W (n) h,ϕ ⋆ ) -1 + n∈[N ] H h=1 E s h ∼(P ⋆ ,πn ) a h ∼πn α n ∥ϕ ⋆ h (s h , a h )∥ (U (n) h,ϕ ⋆ ) -1 (i) ≤ 17α N √ dKN + γ N H h=1 N n∈[N ] E s h ∼(P ⋆ ,πn ) a h ∼πn ∥ϕ ⋆ h (s h , a h )∥ 2 (W (n) h,ϕ ⋆ ) -1 + α N H h=1 KN n∈[N ] E s h ∼(P ⋆ ,πn) a h ∼U ∥ϕ ⋆ h (s h , a h )∥ 2 (U (n) h,ϕ ⋆ ) -1 (ii) ≤ 17 ζ 2β 3 dK(K + d 2 )N + H 45β 3 ζdK(K + d 2 ) dN ζ + H β 3 ζ(K + d 2 ) dKN ζ ≤ 32ζHd β 3 K(d 2 + K)N , where (i) follows from Cauchy-Schwarz inequality and importance sampling, and (ii) follows from Lemma 10. Hence, the statement of Proposition 5 is verified.

A.4 PROOF OF PROPOSITION 3

Based on Proposition 5, we argue that with enough number of iterations, RAFFLE can find P ϵ satisfying the condition in line 15 of Algorithm 1. Proposition 6 (Restatement of Proposition 3). Fix any δ ∈ (0, 1), ϵ > 0. Suppose the algorithm runs for N = 2 14 β3H 2 d 2 K(d 2 +K) log 2 (2|Φ||Ψ|H 3 d 2 K(d 2 +K)/(δϵ 2 )) ϵ 2 iterations, with probability at least 1 -δ, RAFFLE can find an n ϵ ≤ N in the exploration phase such that 2 V πn ϵ P (nϵ) , b(nϵ) + 2 Kζ nϵ ≤ ϵ. In other words, Algorithm 1 can output P ϵ = P (nϵ) satisfying the condition in line 15. In addition V π P ⋆ ,r -V π P (nϵ) ,r ≤ ϵ/2. Proof. We show that the algorithm terminates by contradiction. If it does not stop, applying Proposition 5, we have ϵN/2 < n∈[N ] V πn P (n) , b(n) + Kζ n ≤ 32ζHd β 3 K(d 2 + K)N . Therefore, N < 2 12 β 3 H 2 d 2 K(d 2 + K)ζ 2 ϵ 2 Recall ζ = N ζ N = log (2|Φ||Ψ|N H/δ). Using the fact that n ≤ c log 2 (α n n) ⇒ n ≤ 4c log 2 (α n c), ∀c ≥ e 2 , n ≥ 1, α n ∈ R + , it can be concluded that N < 2 14 β 3 H 2 d 2 K(d 2 + K) log 2 (2|Φ||Ψ|H 3 d 2 K(d 2 + K)/(δϵ 2 )) ϵ 2 , which is a contradiction. Therefore, there exists an n ϵ = O( H 2 d 2 K(d 2 +K) log 2 (|Φ||Ψ|H 3 d 2 K(d 2 +K)/(δϵ 2 )) ϵ 2 ) such that P ϵ = P (nϵ) satisfies 2 V πn ϵ P (nϵ ) , b(nϵ) + 2 Kζ nϵ ≤ ϵ. Combining Proposition 4, we finish the proof. Proof of Theorem 2. Recall that P ϵ is the output of RAFFLE in the n ϵ -iteration. Then, by Proposition 6 V ⋆ P ⋆ ,r -V π P ⋆ ,r ≤ V π ⋆ P ϵ ,r -V π P ⋆ ,r + ϵ/2 (i) ≤ V π P ϵ ,r -V π P ⋆ ,r + ϵ/2 ≤ ϵ/2 + ϵ/2 = ϵ, where (i) follows from the definition of π. The number of trajectories n ϵ H is at at most O H 3 d 2 K(d 2 + K) log 2 (|Φ||Ψ|H 3 d 2 K(d 2 + K)/(δϵ 2 )) ϵ 2 B PROOF OF THEOREM 3 In this section, we adopt the same notations as in Appendix A. The following lemma provides an upper bound for the estimation error of any learned model from the true model. Lemma 4. Fix δ ∈ (0, 1), for any h ∈ [H], n ∈ N + , any policy π, with probability at least 1 -δ/2, E s h ∼(P ⋆ ,π) s h ∼π f (n) h (s h , a h ) ≤ 2 Kζ n + 2 V πn P (n) , b(n) . Proof. Recall that f (n) h (s, a) = P (n) h (•|s, a) -P ⋆ h (•|s, a) T V . Fix any policy π, for any h ≥ 2, we have E s h ∼( P (n) ,π) a h ∼π Qπ h, P (n) , b(n) (s h , a h ) = E s h-1 ∼( P (n) ,π) a h-1 ∼π P (n) h V π h, P (n) , b(n) (s h-1 , a h-1 ) ≤ E s h-1 ∼( P (n) ,π) a h-1 ∼π min 1, b(n) h-1 (s h-1 , a h-1 ) + P (n) h-1 V π h, P (n) , b(n) (s h-1 , a h-1 ) = E s h-1 ∼( P (n) ,π) a h-1 ∼π Qπ h-1, P (n) , b(n) (s h-1 , a h-1 ) ≤ . . . ≤ E a1∼π Qπ 1, P (n) , b(n) (s 1 , a 1 ) = V π P (n) , b(n) . Hence, for h ≥ 2, we have E s h ∼( P (n) ,π) a h ∼π f (n) h (s h , a h ) (i) ≤ E s h-1 ∼( P (n) ,π) a h-1 ∼π b(n) h-1 (s h-1 , a h-1 ) (ii) ≤ E s h-1 ∼( P (n) ,π) a h-1 ∼π Qπ h-1, P (n) , b(n) (s h-1 , a h-1 ) (iii) ≤ V π P (n) , b(n) , where (i) follows from Equation ( 18), (ii) follows from the definition of Qπ h-1, P (n) , b(n) (s h-1 , a h-1 ) and (iii) follows from Equation ( 19). E ∼(P ⋆ ,π) s h ∼π f (n) h (s h ,a h ) ≤ E s h ∼( P (n) ,π) a h ∼π f (n) h (s h , a h ) + E s h ∼(P ⋆ ,π) a h ∼π f (n) h (s h , a h ) - E s h ∼( P (n) ,π) a h ∼π f (n) h (s h , a h ) (i) ≤ ( V π P (n) , b(n) + Kζ n ) + Kζ n + V π P (n) , b(n) (ii) ≤ 2 Kζ n + 2 V πn P (n) , b(n) , where the first term in (i) is due to Equation (20) and Equation ( 14), the second term in (i) is due to Proposition 4 and (ii) follows from the definition of π n . Proof of Theorem 3. By Proposition 6, let n ϵ = O H 2 d 2 K(d 2 +K) log 2 (|Φ||Ψ|N H 3 d 2 K(K+d 2 )/(δϵ 2 )) ϵ 2 , with no more than n ϵ H trajectories, RAFFLE can learn a model P ϵ , bonus bϵ and policy π ϵ at the n ϵ -th iteration satisfying 2V 2020) requires S ≥ K, where S, K denote the cardinality of state and action space respectively. Our hard MDP instances remove the assumption that S ≥ K by constructing the action set with two types of actions. The first type of actions is mainly used to form a large state space through a tree structure. The second type of actions is mainly used to distinguish different MDPs. Such a construction allows us to separately treat the state space and the action space, so that both state and action spaces can be arbitrarily large. We then explicitly define the feature vectors for all state-action pairs and show our hard MDP instances have a lowrank structure with dimension d = S. In a nutshell, we construct a family of HdK MDPs that are hard to distinguish in KL divergence, while the corresponding optimal policies are very different as shown in Figure 1 . First, we define a reference MDP M 0 as follows. We start with the construction of the state space S and the action space A. • Let A = {a w , a 1 , a 2 } A 0 , where a w denotes 'waiting action', a 1 , a 2 are two unique actions that form the binary tree, and |A 0 | = K -3. Then, the transition probabilities of M 0 are specified through the following rules. • The initial state is the waiting state s w . • If the agent takes the waiting action a w before time step H, waiting state s w stays on itself. Otherwise, s w transits to the root state s 1,1 of the binary tree. Mathematically, P h [s w |s w , a w ] = 1 h≤ H , and P h [s 1,1 |s w , a] = 1 a̸ =aw or h> H . • When i < D, for states s i,j in the binary tree, we have the following transition rules: -If the agent takes actions a 1 or a 2 , s i,j deterministically transits to its children s i+1,2j-1 or s i+1,2j , respectively. Mathematically, P h [s i+1,2j-1 |s i,j , a 1 ] = 1, and P[s i+1,2j |s i,j , a 2 ] = 1. -If the agent takes any action other than a 1 or a 2 , the agent will reach the outlier state s o . Mathematically, P h [s o |s i,j , a] = 1, ∀a ̸ = a 1 , a 2 . • Leaf state s D,j uniformly transits to good state s g and bad state s b no matter what action the agent takes. Mathematically, P h [s g |s D,j , a] = P h [s b |s D,j , a] = 1 2 , ∀a ∈ A. • Good state s g , bad state s b , and outlier state s o are absorbing states. Now, we define the features as follows, which are S-dimensional vectors. From Equation ( 22), the event is equal to the event {V * M (h * ,ℓ * ,a * ) -V πτ M (h * ,ℓ * ,a * ) ≤ ϵ}. As a result, P (h * ,ℓ * ,a * ) ε τ (h * ,ℓ * ,a * ) = P (h * ,ℓ * ,a * ) V * M (h * ,ℓ * ,a * ) -V πτ M (h * ,ℓ * ,a * ) ≤ ϵ ≥ 1 -δ. Recall that N τ (h * ,ℓ * ,a * ) = τ n=1 1 {(s n h * ,s n h * )=(s ℓ * ,a * )} such that (h * ,ℓ * ,a * ) N τ (h * ,ℓ * ,a * ) ≤ τ . This inequality holds because the agent is likely to fall into the outlier state s o . We denote P 0 and E 0 to be with respect to M 0 . Now, we invoke an intermediate result in the proof of Theorem 7 in Domingues et al. (2020) to conclude that E 0 N τ (h * ,ℓ * ,a * ) ≥ 1 16ϵ 2 1 -P 0 {ε τ (h * ,ℓ * ,a * ) } log 1 δ -log(2) . Summing over all (h * , ℓ * , a * ), we have E 0 [τ ] ≥ (h * ,ℓ * ,a * ) E 0 N τ (h * ,ℓ * ,a * ) ≥ 1 16ϵ 2     HLK - (h * ,ℓ * ,a * ) P 0 ε τ (h * ,ℓ * ,a * )   log( 1 δ ) -HLK log 2   . ( ) Notice that (h * ,ℓ * ,a * ) P 0 ε τ (h * ,ℓ * ,a * ) = E 0   (h * ,ℓ * ,a * ) 1 {P πτ M (h * ,ℓ * ,a * ) [s h * =s ℓ * ,a h * =a * ]≥ 1 2 }   ≤ 1. Substituting Equation ( 24) into Equation ( 23) yields E 0 [τ ] ≥ 1 16ϵ 2 HLK -1 log( 1 δ ) -HLK log 2 ≥ 1 32ϵ 2 HLK log( 1 δ ), where we use the fact that δ < 1/16. With the assumption of K ≥ 3, S ≥ 6, we have d = S. Taking H = H 3 and with the assumption of D ≤ H/3, we have E 0 [τ ] = Ω HdK ϵ 2 log( 1 δ ) . Then following the analysis similar to that for Corollary 8 in Domingues et al. ( 2020), with probability at least 1 -δ, the number of iterarion is at least Ω HdK ϵ 2 log( 1 δ ) .

D ALGORITHM 2: REPLEARN AND PROOF OF THEOREM 4

We first present the full algorithm in Section 6 below as Algorithm 2.

D.1 SUPPORTING LEMMAS

We first show that Q π P ,h,r can approximate Q π P ⋆ ,h,r well over distribution {q h } h∈[H] by following two lemmas. Lemma 5. Given any δ ∈ (0, 1). Let P be the output of Algorithm 1, for any policy π and rewards r, with probability at least 1 -δ, we have  E (s h ,a h )∼( P ,π) Q π P ⋆ ,h,r (s h , a h ) -Q π P ,h,r (s h , a h ) ≤ϵ E (s h ,a h )∼(P ⋆ ,π) Q π P ⋆ ,h,r (s h , a h ) -Q π P ,h, Output: φ = { φh } h∈[H] . Proof. Define fh (s h , a h ) = Ph (•|s h , a h ) -P ⋆ (•|s h , a h ) T V and f is a collection of all , i.e. fh h∈[H] . E (s h ,a h )∼( P ,π) Q π P ⋆ ,h,r (s h , a h ) -Q π P ,h,r (s h , a h ) (i) ≤ E (s h ,a h )∼( P ,π) Qπ h, P , f (s h , a h ) (ii) ≤ V π P , f (iii) ≤ ϵ/2, where (i) follows from Equation ( 15), (ii) follows from the definition of V and Q, and (iii) follows from the proof of Theorem 2. Then for the second inequality, E (s h ,a h )∼(P ⋆ ,π) Q π P ⋆ ,h,r (s h , a h ) -Q π P ,h,r (s h , a h ) ≤ E (s h ,a h )∼( P ,π) Q π P ⋆ ,h,r (s h , a h ) -Q π P ,h,r (s h , a h ) + E(s h ,a h )∼(P ⋆ ,π) Q π P ⋆ ,h,r (s h ,a h )-Q π P ,h,r (s h ,a h ) -E (s h ,a h )∼( P ,π) Q π P ⋆ ,h,r (s h ,a h )-Q π P ,h,r (s h ,a h ) (i) ≤ ϵ/2 + V π P , f (ii) ≤ ϵ, where the first term in (i) is due to Equation (25) and Equation ( 14), the second term in (i) is due to Step 1 in Proposition 4 and (ii) follows from the proof of Theorem 2. Lemma 6. Given any δ, ϵ ∈ (0, 1) and the output of Algorithm 1, under Assumption 2, for any policy π and rewards r, let the input distributions {q h } H h=1 are bounded with constant C B . Denote C min = C B ηmin . Then with probability at least 1 -δ, for each h ∈ [H], E (s h ,a h )∼q h Q π P ⋆ ,h,r (s h , a h ) -Q π P ,h,r (s h , a h ) ≤ ϵC min . Proof. First, together with Assumption 2, for any (s, a) ∈ S × A, we have q h (s, a) ≤ C min P π 0 h (s, a), then E (s h ,a h )∼q h Q π P ⋆ ,h,r (s h , a h ) -Q π P ,h,r (s h , a h ) = q h (s h , a h ) Q π P ⋆ ,h,r (s h , a h ) -Q π P ,h,r (s h , a h ) ds h da h (i) = C min P π 0 h (s h ) Q π P ⋆ ,h,r (s h , a h ) -Q π P ,h,r (s h , a h ) ds h da h = C min E (s h ,a h )∼(P ⋆ ,π 0 ) Q π P ⋆ ,h,r (s h , a h ) -Q π P , h,r (s h , a h ) (ii) ≤ ϵC min , where (i) follows Assumption 2 and (ii) follows Lemma 5. The lemma above shows that for any reward r, Q π P ,h,r (s h , a h ) can be the target of Q π P ⋆ ,h,r (s h , a h ) when (s h , a h ) are chosen from given distribution q h . We then show that for any h, when reward r is set to be zero at step h, Q π P ⋆ ,h,r has a linear structure w.r.t ϕ ⋆ h . Lemma 7. For any h ∈ [H], policy π and given (s h , a h ) ∈ S × A, given any r such that r is set to be zero at step h, i.e. r h = 0, then Q π P ⋆ ,h,r (s h , a h ) is linear with respect to ϕ ⋆ (s h , a h ), i.e. there exist a w ⋆ h such that Q π P ⋆ ,h,r (s h , a h ) = ⟨ϕ ⋆ h (s h , a h ), w ⋆ h ⟩. Proof. Q π P ⋆ ,h,r (s h , a h ) = r h (s h , a h ) + E s h+1 ∼P ⋆ (•|s h ,a h ) V π P ⋆ ,h+1,r (s h+1 ) s h , a h = P ⋆ (s h+1 |s h , a h )V π P ⋆ ,h+1,r (s h+1 )ds h+1 = ϕ ⋆ h (s h , a h ), µ ⋆ h (s h+1 )V π P ⋆ ,h+1,r (s h+1 )ds h+1 = ⟨ϕ ⋆ h (s h , a h ), w ⋆ h ⟩ , where w ⋆ h = µ ⋆ h (s h+1 )V π P ⋆ ,h+1,r (s h+1 )ds h+1 .

D.2 PROOF OF THEOREM 4

Proof of Theorem 4. We allow ϕ and Q π t P ,h,r h,t to apply to all the samples in a dataset matrix simultaneously, i.e. ϕ h (D N f ,t h ) = (ϕ h (s 1,t h ), . . . , ϕ h (s N f ,t h )) ⊤ ∈ R N f ×d and Q π t P ,h,r h,t (D N f ,t h ) = (Q π t P ,h,r h,t (s 1,t h ), . . . , Q π t P ,h,r h,t (s N f ,t )) ⊤ ∈ R N f . After we estimating φh and then taking it as known, from the property of linear regression in Equation ( 4), for any reward function r h,t , any policy π t we got w t h = ( φh (D N f ,t h ) ⊤ φh (D N f ,t h )) † φh (D N f ,t h ) ⊤ Q π P ,h,r h,t (D N f ,t h ) φh (D N f ,t h ) w t h = P φh (D N f ,t h ) Q π t P ,h,r h,t (D N f ,t h ), where P φh (D N f ,t h ) = φh (D N f ,t h )( φh (D N f ,t h ) ⊤ φh (D N f ,t h )) † φh (D N f ,t h ) ⊤ , represents the projection operator to the column spaces of φh (D N f ,t h ). Then t∈[T ] P ⊥ φh (D N f ,t h ) Q π t P ,h,r h,t (D N f ,t h ) 2 = t∈[T ] φh (D N f ,t h ) w t h -Q π t P ,h,r h,t (D N f ,t h ) 2 (i) ≤ t∈[T ] ϕ ⋆ h (D N f ,t h )w t h ⋆ -Q π t P ,h,r h,t (D N f ,t h ) 2 = t∈[T ] N f n=1 Q π t P ⋆ ,h,r h,t (s n,t h , a n,t h ) -Q π t P ,h,r h,t (s n,t h , a n,t h ) 2 (ii) ≤ N f t∈[T ] E (s h ,a h )∼q h Q π t P ⋆ ,h,r h,t (s n,t h , a n,t h ) -Q π t P ,h,r h,t (s n,t h , a n,t h ) 2 + T N f log 2 δ 2 (iii) ≤ N f t∈[T ] E (s h ,a h )∼q h Q π t P ⋆ ,h,r h,t (s n,t h , a n,t h ) -Q π t P ,h,r h,t (s n,t h , a n,t h ) + T N f log 2 δ 2 (iv) ≤ ϵC min N f T + T N f log 2 δ 2 , where (i) follows from minimality of { φt h } t∈[T ] and { w t h } t∈[T ] , (ii) follows Hoeffding's inequality, (iii) follows that Q π t P ⋆ ,h,r h,t (s n,t h , a n,t h ) -Q π t P ,h,r h,t (s n,t h , a n,t h ) ≤ 1 and (iv) follows Lemma 6. As a result: t∈[T ] P ⊥ φh (D N f ,t h ) ϕ ⋆ h (D N f ,t h )w t h ⋆ 2 = t∈[T ] P ⊥ φh (D N f ,t h ) Q π t P ⋆ ,h,r h,t (D N f ,t h ) 2 ≤ t∈[T ] P ⊥ φh (D N f ,t h ) Q π t P ,h,r h,t (D N f ,t h ) 2 + P ⊥ φh (D N f ,t h ) Q π t P ⋆ ,h,r h,t (D N f ,t h ) -Q π P ,h,r h,t (D N f ,t h ) 2 (i) ≤ t∈[T ] ϵC min N f T + T N f log 2 δ 2 + t∈[T ] σ 2 1 (P ⊥ φh (D N f ,t h ) )∥Q π t P ⋆ ,h,r h,t (D N f ,t h )) -Q π t P ,h,r h,t (D N f ,t h )∥ 2 (ii) ≤ 2ϵC min N f T + 2T N f log 2 δ , where (i) follows from ∥Av∥ 2 ≤ σ 1 (A) ∥v∥ 2 and (ii) follows from that σ 1 (P ⊥ φh (D N f ,t h ) ) ≤ 1 and the process to derive Equation (26).  H 5 d 3 LV K 5 ϵ 2 η ) HOMER (MISRA ET AL., 2020) BLOCK MDP Õ(d 8 H 4 K 4 ( 1 ϵ 2 + 1 η 3 )) RAFFLE (OURS) LOW-RANK MDP Õ( H 3 d 2 K(d 2 +K) ϵ 2 ) 1 We do not include reward-known RL under low-rank MDPs in this table and only focus reward-free RL. The detailed discussion of reward-known RL under low-rank MDPs can be found in Section 7. We finally use the technique in Du et al. (2021b) to derive the super population guarantee. 2ϵC min N f T + 2T N f log 2 δ ≥ t∈[T ] P ⊥ φh (D N f ,t h ) ϕ ⋆ h (D N f ,t h )w t h ⋆ 2 F = I -φh (D N f ,t h ) φh (D N f ,t h ) ⊤ φh (D N f ,t h ) † φh (D N f ,t h ) ⊤ ϕ ⋆ h (D N f ,t h )w t h ⋆ 2 F = t∈[T ] (w t h ⋆ ) ⊤ ϕ ⋆ h (D N f ,t h ) ⊤ I -φh (D N f ,t h )( φh (D N f ,t h ) ⊤ φh (D N f ,t h )) † φh (D N f ,t h ) ⊤ ϕ ⋆ h (D N f ,t h )w t h ⋆ = t∈[T ] N f (w t h ⋆ ) ⊤ D D N f ,t h (ϕ ⋆ h , φh )w t h ⋆ (i) ≥ 0.9 There is some concurrent work also using an optimistic MLE-based approach for different settings (POMDP) (Liu et al., 2022; Chen et al., 2022a) . We elaborate the key differences between our paper and Liu et al. (2022) ; Chen et al. (2022a) as follows. t∈[T ] N f (w t h ⋆ ) ⊤ D q h (ϕ ⋆ h , φh )w t h ⋆ = 0.9N f t∈[T ] D q h (ϕ ⋆ h , φh ) 1/2 w t h ⋆ 2 ≥ 0.9N f D q h (ϕ ⋆ h , φh ) 1/2 W ⋆ h 2 F (ii) ≥ 0.9N f D q h (ϕ ⋆ h , φh ) 1/2 2 F σ 2 d (W ⋆ h ) ≥ 0.9N f D q h (ϕ ⋆ h , φh ) • The two POMDP papers mentioned above study reward-known setting, whereas our focus here is on the reward-free setting. • Although optimistic MLE-based approach is used in both settings, the design of exploration policy in the two settings are different. Our algorithm identifies which estimated model is used for exploration policy design and then calculate the exploration policy based on bonus terms designed for the value function. However, the POMDP papers construct a confidence set about the true model and solve an optimization problem within this generic model set. Such an oracle may not be easy to realize computationally. • Due to the hardness of POMDP, MLE approach only guarantees that the estimated distribution of the trajectories is close to the true one. In contrast, in low-rank MDP we study here, we show that the estimation error for the transition probabilities is controlled at each time step.

F AUXILIARY LEMMAS

Recall f (n) h (s, a) = ∥ P (n) h (•|s, a) -P ⋆ h (•|s, a)∥ T V represents the estimation error in the n-th iteration at step h, given state s and action a, in terms of the total variation distance. By using Theorem 21 in Agarwal et al. (2020) , we are able to guarantee that under all exploration policies, the estimation error can be bounded with high probability. Lemma 8. (MLE guarantee). Given δ ∈ (0, 1), we have the following inequality holds for any n, h ≥ 2 with probability at least 1 -δ/2: n-1 τ =0 E s h-1 ∼(P ⋆ ,πτ ) (a h-1 ,a h )∼U (A) s h ∼P ⋆ (•|s h-1 ,a h-1 ) f n h (s h , a h ) 2 ≤ nζ n , where ζ n := log (2|Φ||Ψ|nH/δ) n . In addition, for h = 1, n-1 τ =0 E a1∼U (A) f n 1 (s 1 , a 1 ) 2 ≤ nζ n . Divide both sides of the result of lemma Lemma 8 by n, and define Π n = U(π 1 , . . . , π n-1 ), we have the following corollary which will be intensively used in the analysis. Corollary 2. Given δ ∈ (0, 1), the following inequality holds for any n, h ≥ 2 with probability at least 1 -δ/2: E s h-1 ∼(P ⋆ ,Πn) a h-1 ,a h ∼U (A) s h ∼P ⋆ (•|s h-1 ,a h-1 ) f n h (s h , a h ) 2 ≤ ζ n . In addition, for h = 1, E a1∼U (A) f n 1 (s 1 , a 1 ) 2 ≤ ζ n . The following lemma (Dann et al., 2017) will be useful to measure the difference between two value functions under two MDPs and reward functions. We define P h V h+1 (s h , a h ) = E s∼P h (•|s h ,a h ) [V (s)] for shorthand. Lemma 9. (Simulation lemma). Suppose P 1 and P 2 are two MDPs and r 1 , r 2 are the corresponding reward functions. Given a policy π, we have, If we choose any subset of the set {X n M -1 n-1 } N n=1 , we can still get a sublinear summation as follows. V π h,



i∈[D],j∈[2 i-1 ] {s i,j } form a binary tree, where s i,j denotes the j-th branch node of the layer i.



2: Initialize π 0 (•|s) to be uniform; set D 0 h = ∅. 3: Phase I: Exploration Phase 4: for n = 1, . . . do 5:

which is an upper bound of value function difference under the true model with the estimated model (see proposition 5 in appendix A.3). Proposition 2 (Informal). Under the same setting of Proposition 1, with high probability, the summation of value functions V πn P (n) , b(n) under exploration policies {π n } n∈[N ] with exploration-driven reward functions b(n) is sublinear, as given by n∈[N ]

the exploration-driven reward b

Figure 1: Hard MDP instances.

(2020)  and Lemma 10 inUehara et al. (2021)).Lemma 10. (Elliptical potential lemma). Consider a sequence of d×d positive semidefinite matricesX 1 , . . . , X N with tr(X n ) ≤ 1 for all n ∈ [N ]. Define M 0 = λ 0 I and M n = M n-1 + X n . Then N n=1 tr X n M -1 n-1 ≤ 2 log det(M N ) -2 log det(M 0 ) ≤ 2d log 1 + N dλ 0 .

further improved the sample complexity, and Zhang et al. (2020) established the minimax optimality guarantee. Reward-free RL was also studied with function approximation. Wang et al. (2020) studied linear MDPs, and Zhang et al. (2021b) studied linear mixture MDPs. Further, Zanette et al. (2020b) considered a class of MDPs with low inherent Bellman error introduced by Zanette et al. (2020a). Chen et al. (2022b) proposed a reward-free algorithm called RFOLIVE under non-linear MDPs with low Bellman Eluder dimension. In addition, Miryoosefi & Jin (

πϵ P ϵ , bϵ + 2 Kζ nϵ ≤ ϵ. Then, following Lemma 4, we have , a h ) -P ⋆ h (•|s h , a h )∥ T V ≤ 2 Kζ nϵ + 2V πϵ P ϵ , bϵ ≤ ϵ.

• Let S = {s w , s o , s g , s b } i∈[D],j∈[2 i-1 ] {s i,j}, where s w , s o , s g and s b denote 'waiting state', 'outlier state', 'good state', and 'bad state', respectively. The states in

r (s h , a h ) ≤ϵ. Algorithm 2 RepLearn: Representation Learning in Planning Phase of RAFFLE 1: Input: Sample size N f , state-action pair distributions {q h } H h=1 , special designed reward function {r h,t } h∈[H],t∈[T ] and policy {π t } t∈[T ] , and estimated transition kernel P from the output of Algorithm 1, model class: Φ.

Comparison among provably efficient RL algorithms under low-rank MDPs.

where (i) follows from lemma B.1 inDu et al. (2021b)  and N f can be large enough, and (ii) follows from that ∥AB∥ MORE DISCUSSION ON RELATED WORKIn this section, we first summarize the directly comparable work in Appendix E.E.1 LOW-RANK MDPS IN EXTENDED RL SETTINGSMany studies have been developed on various extended low-rank models.Wang et al. (2022);Uehara  et al. (2022a)  studied partially observable Markov decision process (POMDP) with latent low-rank structure.Zhan et al. (2022) studied predictive state representations model, and applied their results to POMDP with latent low-rank structure.Cheng et al. (2022);Agarwal et al. (2022) studied benefits of multitask representation learning under low-rank MDPs.Huang et al. (2022) proposed a general safe RL framework and instantiated it to low-rank MDPs.Ren et al. (2022) studied reward-known RL under low-rank MDPs and proposed a spectral method to replace the computation oracle. We note that even given those further developments, our results on lower bound and representation learning are still completely new, and our algorithm design and result on sample complexity are so far still the best-known result for standard reward-free low-rank MDPs. E.2 DISCUSSION ON OPTIMISTIC MLE-BASED APPROACH FOR DIFFERENT SETTINGS

P1,r1 (s h ) -V π h,P2,r2 (s h ) (s h ′ , a h ′ ) -r 2 (s h ′ , a h ′ ) + (P 1,h ′ -P 2,h ′ )V π h ′ +1,P1,r (s h ′ , a h ′ )|s h (s h ′ , a h ′ ) -r 2 (s h ′ , a h ′ ) + (P 1,h ′ -P 2,h ′ )V π h ′ +1,P2,r (s h ′ , a h ′ )|s h .The following lemma is a standard inequality in regret analysis for linear models in reinforcement learning (see Lemma G.2 in Agarwal et al.

annex

s w ϕ h (s w , a w ) = (1, 0, 0 S-5 , 0, 0, 0) µ h (s w ) = (1 h≤ H , 0, 0 S-5 , 0, 0, 0) ϕ h (s w , a) = (0, 1, 0 S-5 , 0, 0), a ̸ = a w µ h (s 1,1 ) = (1 h> H , 1, 0 S-5 , 0, 0, 0) s i,j , i < D ϕ h (s i,j , a ω ) = (0, 0, e i+1,2j+ω-2 , 0, 0, 0), ω = 1, 2 µ h (s k,ℓ ) = (0, 0, e k,ℓ , 0, 0, 0), 1 < k ≤ D ϕ h (s i,j , a) = (0, 0, 0 S-5 , 1, 0, 0), a ̸ = a 1 , a 2 µ h (s o ) = (0, 0, 0 S-5 , 1, 0, 0)where 0 S-5 ∈ R S-5 denotes the S -5 dimension vector with all zeros and e i,j ∈ R S-5 denotes the one-hot vector that is zero everywhere except theSpecifically, the only difference of M (h * ,ℓ * ,a * ) from M 0 is that the transition probability from the leaf state s D,ℓ * and action a * to the good state s g increases ϵ 0 , i.e.2 -ϵ 0 , where ϵ 0 will be specified later. We note that the features of M (h * ,ℓ * ,a * ) are the same as those of M 0 except that ϕ h * (s D,ℓ * , a * ) = (0, 0, 0 S-5 , 0, 1 2 +ϵ 0 , 1 2 -ϵ 0 ). We remark here that the cardinality K of A can be arbitrarily large, so that the resulting lower bound will hold for both d ≤ K and d ≥ K regimes. In addition, although in our hard instances, d = S, it is straightforward to generalize it to the regime with S > d if we set the outlier state S o to be a set of outlier states S o .Definition of reward: the reward can only be attained in two special states: the good state s g and the outlier state s o at the last stage H.1 {s=so,h=H} , and r h (s, a) still belongs to [0, 1].

C.2 STEP 2: ANALYSIS OF HARD MDP INSTANCES

Proof of Theorem 1. Let ϵ 0 = 2ϵ.For any MDP M h * ,a * , the optimal policy is to take action a w to stay at state s w until stage h * -D, and then take the corresponding action to the only state s D,ℓ * at stage h * . At state s D,ℓ * , the agent takes the only optimal action a * . The optimal value function V * M (h * ,ℓ * ,a * ) = 1 2 + ϵ 0 , and the value function of the output policy of Alg is given bywhere P πτ M (h * ,ℓ * ,a * ) is the probability distribution over the states and actions (s h , a h ) following the Markov policy πτ in the MDP M (h * ,ℓ * ,a * ) . We remark that the reward of outlier state are specially designed to be 1/2 at stage H to make Equation ( 21) hold for policy πτ falling into the outlier state.Hence,andThe transitions of all MDPs are the same when the leaves states are reached. We define the event ε τ (h * ,ℓ * ,a * ) = P πτ M (h * ,ℓ * ,a * ) [s h * = s ℓ * , a h * = a * ] ≥ 1 2 .

