PARTIALLY OBSERVABLE RL WITH B-STABILITY: UNIFIED STRUCTURAL CONDITION AND SHARP SAMPLE-EFFICIENT ALGORITHMS

Abstract

Partial Observability-where agents can only observe partial information about the true underlying state of the system-is ubiquitous in real-world applications of Reinforcement Learning (RL). Theoretically, learning a near-optimal policy under partial observability is known to be hard in the worst case due to an exponential sample complexity lower bound. Recent work has identified several tractable subclasses that are learnable with polynomial samples, such as Partially Observable Markov Decision Processes (POMDPs) with certain revealing or decodability conditions. However, this line of research is still in its infancy, where (1) unified structural conditions enabling sample-efficient learning are lacking; (2) existing sample complexities for known tractable subclasses are far from sharp; and (3) fewer sample-efficient algorithms are available than in fully observable RL. This paper advances all three aspects above for Partially Observable RL in the general setting of Predictive State Representations (PSRs). First, we propose a natural and unified structural condition for PSRs called B-stability. B-stable PSRs encompasses the vast majority of known tractable subclasses such as weakly revealing POMDPs, low-rank future-sufficient POMDPs, decodable POMDPs, and regular PSRs. Next, we show that any B-stable PSR can be learned with polynomial samples in relevant problem parameters. When instantiated in the aforementioned subclasses, our sample complexities improve substantially over the current best ones. Finally, our results are achieved by three algorithms simultaneously: Optimistic Maximum Likelihood Estimation, Estimation-to-Decisions, and Model-Based Optimistic Posterior Sampling. The latter two algorithms are new for sample-efficient learning of POMDPs/PSRs. We additionally design a variant of the Estimation-to-Decisions algorithm to perform sample-efficient all-policy model estimation for B-stable PSRs, which also yields guarantees for reward-free learning as an implication.

1. INTRODUCTION

Partially Observable Reinforcement Learning (RL)-where agents can only observe partial information about the true underlying state of the system-is ubiquitous in real-world applications of RL such as robotics (Akkaya et al., 2019) , strategic games (Brown & Sandholm, 2018; Vinyals et al., 2019; Berner et al., 2019) , economic simulation (Zheng et al., 2020) , and so on. Partially observable RL defies standard efficient approaches for learning and planning in the fully observable case (e.g. those based on dynamical programming) due to the non-Markovian nature of the observations (Jaakkola et al., 1994) , and has been a hard challenge for RL research. Theoretically, it is well-established that learning in partial observable RL is statistically hard in the worst case-In the standard setting of Partially Observable Markov Decision Processes (POMDPs), learning a near-optimal policy has an exponential sample complexity lower bound in the horizon length (Mossel & Roch, 2005; Krishnamurthy et al., 2016) , which in stark contrast to fully observable MDPs where polynomial sample complexity is possible (Kearns & Singh, 2002; Jaksch et al., ˚Equal contribution. Table 1 : Comparisons of sample complexities for learning an ε near-optimal policy in POMDPs and PSRs. Definitions of the problem parameters can be found in Section 3.2. The last three rows refer to the m-step versions of the problem classes (e.g. the third row considers m-step αrev-revealing POMDPs). The current best results within the last four rows are due to Zhan et al. (2022) ; Liu et al. (2022a) ; Wang et al. (2022) ; Efroni et al. (2022) Azar et al., 2017) . A later line of work identifies various additional structural conditions or alternative learning goals that enable sample-efficient learning, such as reactiveness (Jiang et al., 2017) , revealing conditions (Jin et al., 2020a; Liu et al., 2022c; Cai et al., 2022; Wang et al., 2022) , decodability (Du et al., 2019; Efroni et al., 2022) , and learning memoryless or short-memory policies (Azizzadenesheli et al., 2018; Uehara et al., 2022b) . Despite these progresses, research on sample-efficient partially observable RL is still at an early stage, with several important questions remaining open. First, to a large extent, existing tractable structural conditions are mostly identified and analyzed in a case-by-case manner and lack a more unified understanding. This question has just started to be tackled in the very recent work of Zhan et al. (2022) , who show that sample-efficient learning is possible in the more general setting of Predictive State Representations (PSRs) (Littman & Sutton, 2001 )-which include POMDPs as a special case-with a certain regularity condition. However, their regularity condition is defined in terms of additional quantities (such as "core matrices") not directly encoded in the definition of PSRs, which makes it unnatural in many known examples and unable to subsume important tractable problems such as decodable POMDPs. Second, even in known sample-efficient problems such as revealing POMDPs (Jin et al., 2020c; Liu et al., 2022a) , existing sample complexities involve large polynomial factors of relevant problem parameters that are likely far from sharp. Third, relatively few principles are known for designing sample-efficient algorithms in POMDPs/PSRs, such as spectral or tensor-based approaches (Hsu et al., 2012; Azizzadenesheli et al., 2016; Jin et al., 2020c) , maximum likelihood or density estimation (Liu et al., 2022a; Wang et al., 2022; Zhan et al., 2022) , or learning short-memory policies (Efroni et al., 2022; Uehara et al., 2022b) . This contrasts with fully observable RL where the space of sample-efficient algorithms is much more diverse (Agarwal et al., 2019) . It is an important question whether we can expand the space of algorithms for partially observable RL. This paper advances all three aspects above for partially observable RL. We define B-stablility, a natural and general structural condition for PSRs, and design sharp algorithms for learning any B-stable PSR sample-efficiently. Our contributions can be summarized as follows. • We identify a new structural condition for PSRs termed B-stability, which simply requires its Brepresentation (or observable operators) to be bounded in a suitable operator norm (Section 3.1). B-stable PSRs subsume most known tractable subclasses such as revealing POMDPs, decodable POMDPs, low-rank future-sufficient POMDPs, and regular PSRs (Section 3.2). • We show that B-stable PSRs can be learned sample-efficiently by three algorithms simultaneously with sharp sample complexities (Section 4): Optimistic Maximum Likelihood Estimation (OMLE), Explorative Estimation-to-Decisions (EXPLORATIVE E2D), and Model-based Optimistic Posterior Sampling (MOPS). To our best knowledge, the latter two algorithms are first shown to be sample-efficient in partially observable RL. • Our sample complexities improve substantially over the current best when instantiated in both regular PSRs (Section 4.1) and known tractable subclasses of POMDPs (Section 5). For example, for m-step α rev -revealing POMDPs with S latent states, our algorithms find an ε nearoptimal policy within r O `S2 A m log N {pα 2 rev ε 2 q ˘episodes of play (with S 2 {α 2 rev replaced by SΛ 2 B if measured in B-stability), which improves significantly over the current best result of r O `S4 A 6m´4 log N {pα 4 rev ε 2 q ˘. A summary of such comparisons is presented in Table 1 . • As a variant of the E2D algorithm, we design the ALL-POLICY MODEL-ESTIMATION E2D algorithm that achieves sample-efficient all-policy model estimation-and as an application, rewardfree learning-for B-stable PSRs (Section 4.2 & Appendix H.2). • Technically, our three algorithms rely on a unified sharp analysis of B-stable PSRs that involves a careful error decomposition in terms of its B-representation, along with a new generalized ℓ 2 -type Eluder argument, which may be of future interest (Appendix B). Related work Our work is closely related to the long lines of work on sample-efficient learning of fully/partially observable RL (with/without function approximation), especially the lines of work on POMDPs and PSRs. We review these related works in Appendix A due to the space limit.

2. PRELIMINARIES

Sequential decision processes with observations An episodic sequential decision process is specified by a tuple ␣ H, O, A, P, tr h u H h"1 ( , where H P Z ě1 is the horizon length; O is the observation space with |O| " O; A is the action space with |A| " A; P specifies the transition dynamics, such that the initial observation follows o 1 " P 0 p¨q P ∆pOq, and given the history τ h :" po 1 , a 1 , ¨¨¨, o h , a h q up to step h, the observation follows o h`1 " Pp¨|τ h q; r h : O ˆA Ñ r0, 1s is the reward function at h-th step, which we assume is a known deterministic function of po h , a h q. A policy π " tπ h : pO ˆAq h´1 ˆO Ñ ∆pAqu H h"1 is a collection of H functions. At step h P rHs, an agent running policy π observes the observation o h and takes action a h " π h p¨|τ h´1 , o h q P ∆pAq based on the history pτ h´1 , o h q " po 1 , a 1 , . . . , o h´1 , a h´1 , o h q. The agent then receives their reward r h po h , a h q, and the environment generates the next observation o h`1 " Pp¨|τ h q based on τ h " po 1 , a 1 , ¨¨¨, o h , a h q. The episode terminates immediately after the dummy observation o H`1 " o dum is generated. We use Π to denote the set of all deterministic policies, and identify ∆pΠq as both the set of all policies and all distributions over deterministic policies interchangeably. For any ph, τ h q, let Ppτ h q :" ś h 1 ďh Ppo h 1 |τ h 1 ´1q, πpτ h q :" ś h 1 ďh π h 1 pa h 1 |τ h 1 ´1, o h 1 q, and let P π pτ h q :" Ppτ h q ˆπpτ h q denote the probability of observing τ h (for the first h steps) when executing π. The value of a policy π is defined as the expected cumulative reward V pπq :" E π r ř H h"1 r h po h , a h qs. We assume that ř H h"1 r h po h , a h q ď 1 almost surely for any policy π. POMDPs A Partially Observable Markov Decision Process (POMDP) is a special sequential decision process whose transition dynamics are governed by latent states. An episodic POMDP is specified by a tuple tH, S, O, A, tT h u H h"1 , tO h u H h"1 , tr h u H h"1 , µ 1 u, where S is the latent state space with |S| " S, O h p¨|¨q : S Ñ ∆pOq is the emission dynamics at step h (which we identify as an emission matrix O h P R OˆS ), T h p¨|¨, ¨q : S ˆA Ñ ∆pSq is the transition dynamics over the latent states (which we identify as transition matrices T h p¨|¨, aq P R SˆS for each a P A), and µ 1 P ∆pSq specifies the distribution of initial state. At each step h, given latent state s h (which the agent cannot observe), the system emits observation o h " O h p¨|s h q, receives action a h P A from the agent, emits the reward r h po h , a h q, and then transits to the next latent state s h`1 " T h p¨|s h , a h q in a Markov fashion. Note that a POMDP can be fully described by the parameter θ :" pT, O, µ 1 q.

2.1. PREDICTIVE STATE REPRESENTATIONS

We consider Predictive State Representations (PSRs) (Littman & Sutton, 2001) , a broader class of sequential decision processes that generalize POMDPs by removing the explicit assumption of latent states, but still requiring the system dynamics to be described succinctly by a core test set. PSR, core test sets, and predictive states A test t is a sequence of future observations and actions (i.e. t P T :" Ť W PZě1 O W ˆAW ´1). For some test t h " po h:h`W ´1, a h:h`W ´2q with length W ě 1, we define the probability of test t h being successful conditioned on (reachable) history τ h´1 as Ppt h |τ h´1 q :" Ppo h:h`W ´1|τ h´1 ; dopa h:h`W ´2qq, i.e., the probability of observing o h:h`W ´1 if the agent deterministically executes actions a h:h`W ´2, conditioned on history τ h´1 . We follow the convention that, if P π pτ h´1 q " 0 for any π, then Ppt|τ h´1 q " 0. Definition 1 (PSR, core test sets, and predictive states). For any h P rHs, we say a set U h Ă T is a core test set at step h if the following holds: For any W P Z ě1 , any possible future (i.e., test) t h " po h:h`W ´1, a h:h`W ´2q P O W ˆAW ´1, there exists a vector b t h ,h P R U h such that Ppt h |τ h´1 q " xb t h ,h , rPpt|τ h´1 qs tPU h y, @τ h´1 P T h´1 :" pO ˆAq h´1 . (1) We refer to the vector qpτ h´1 q :" rPpt|τ h´1 qs tPU h as the predictive state at step h (with convention qpτ h´1 q " 0 if τ h´1 is not reachable), and q 0 :" rPptqs tPU1 as the initial predictive state. A (linear) PSR is a sequential decision process equipped with a core test set tU h u hPrHs . The predictive state qpτ h´1 q P R U h in a PSR acts like a "latent state" that governs the transition Pp¨|τ h´1 q through the linear structure (1). We define U A,h :" ta : po, aq P U h for some o P Ť W PN `OW u as the set of action sequences (possibly including an empty sequence) in U h , with U A :" max hPrHs |U A,h |. Further define U H`1 :" to dum u for notational simplicity. Throughout the paper, we assume the core test sets pU h q hPrHs are known and the same within the PSR model class.

B-representation

We define the B-representation of a PSR, a standard notion for PSRs (also known as the observable operators (Jaeger, 2000) ). Definition 2 (B-representation). A B-representation of a PSR with core test set pU h q hPrHs is a set of matricesfoot_1 tpB h po h , a h q P R U h`1 ˆUh q h,o h ,a h , q 0 P R U1 u such that for any 0 ď h ď H, policy π, history τ h " po 1:h , a 1:h q P T h , and core test t h`1 " po h`1:h`W , a h`1:h`W ´1q P U h`1 , the quantity Ppτ h , t h`1 q, i.e. the probability of observing o 1:h`W upon taking actions a 1:h`W ´1, admits the decomposition Ppτ h , t h`1 q " Ppo 1:h`W |dopa 1:h`W ´1qq " e J t h`1 ¨Bh:1 pτ h q ¨q0 , where e t h`1 P R U h`1 is the indicator vector of t h`1 P U h`1 , and B h:1 pτ h q :" B h po h , a h qB h´1 po h´1 , a h´1 q ¨¨¨B 1 po 1 , a 1 q. It is a standard result (see e.g. Thon & Jaeger (2015) ) that any PSR admits a B-representation, and the converse also holds-any sequential decision process admitting a B-representation on test sets pU h q hPrHs is a PSR with core test set pU h q hPrHs (Proposition D.1). However, the B-representation of a given PSR may not be unique. We also remark that the B-representation is used in the structural conditions and theoretical analyses only, and will not be explicitly used in our algorithms. Rank An important complexity measure of a PSR is its PSR rank (henceforth also "rank"). Definition 3 (PSR rank). Given a PSR, its PSR rank is defined as d PSR :" max hPrHs rankpD h q, where D h :" rqpτ h qs τ h PT h P R U h`1 ˆT h is the matrix formed by predictive states at step h P rHs. The PSR rank measures the inherent dimensionfoot_2 of the space of predictive state vectors, which always admits the upper bound d PSR ď max hPrHs |U h |, but may in addition be much smaller. POMDPs as low-rank PSRs As a primary example, all POMDPs are PSRs with rank at most S (Zhan et al., 2022, Lemma 2). First, Definition 1 can be satisfied trivially by choosing U h " Ť 1ďW ďH´h`1 tpo h , a h , . . . , o h`W ´1qu as the set of all possible tests, and b t h ,h " e t h P R U h as indicator vectors. For concrete subclasses of POMDPs, we will consider alternative choices of pU h q hPrHs with much smaller cardinalities than this default choice. Second, to compute the rank (Definition 3), note that by the latent state structure of POMDPs, we have Ppt h`1 |τ h q " ř s h`1 Ppt h`1 |s h`1 qPps h`1 |τ h q for any ph, τ h , t h`1 q. Therefore, the associated matrix D h " rPpt h`1 |τ h qs pt h`1 ,τ h qPU h`1 ˆT h always has the following decomposition: D h " rPpt h`1 |s h`1 qs pt h`1 ,s h`1 qPU h`1 ˆS ˆrPps h`1 |τ h qs ps h`1 ,τ h qPSˆT h , which implies that d PSR " max hPrHs rankpD h q ď S. Learning goal We consider the standard PAC learning setting, where we are given a model class of PSRs Θ and interact with a ground truth model θ ‹ P Θ. Note that, as we do not put further restrictions on the parametrization, this setting allows any general function approximation for the model class. For any model class Θ, we define its (optimistic) covering number N Θ pρq for ρ ą 0 in Definition C.4. Let V θ pπq denote the value function of policy π under model θ, and π θ :" arg max πPΠ V θ pπq denote the optimal policy of model θ. The goal is to learn a policy p π that achieves small suboptimality V ‹ ´Vθ ‹ pp πq within as few episodes of play as possible, where V ‹ :" V θ ‹ pπ θ ‹ q. We refer to an algorithm as sample-efficient if it finds an ε-near optimal policy within polyprelevant problem parameters, 1{εqfoot_3 episodes of play.

3. PSRS WITH B-STABILITY

We begin by proposing a natural and general structural condition for PSR called B-stability (or also stability). We show that B-stable PSRs encompass and generalize a variety of existing tractable POMDPs and PSRs, and can be learned sample-efficiently as we show in the sequel.

3.1. THE B-STABILITY CONDITION

For any PSR with an associated B-representation, we define its B-operators tB H:h u hPrHs as B H:h : R U h Ñ R pOˆAq H´h`1 , q Þ Ñ rB H:h pτ h:H q ¨qs τ h:H PpOˆAq H´h`1 . Operator B H:h maps any predictive state q " qpτ h´1 q at step h to the vector B H:h q " pPpτ h:H |τ h´1 qq τ h:H which governs the probability of transitioning to all possible futures, by properties of the B-representation (cf. ( 17) & Corollary D.2). For each h P rHs, we equip the image space of B H:h with the Π-norm: For a vector b indexed by τ h:H P pO ˆAq H´h`1 , we define }b} Π :" max π ř τ h:H PpOˆAq H´h`1 πpτ h:H q |bpτ h:H q| , where the maximization is over all policies π starting from step h (ignoring the history τ h´1 ) and πpτ h:H q " ś hďh 1 ďH πh 1 pa h 1 |o h 1 , τ h:h 1 ´1q. We further equip the domain R U h with a fused-norm } ¨}˚, which is defined as the maximum of p1, 2q-norm and Π 1 -normfoot_4 : }q} ˚:" maxt}q} 1,2 , }q} Π 1 u, ( ) }q} 1,2 :" `řaPU A,h `řo:po,aqPU h |qpo, aq| ˘2˘1 {2 , }q} Π 1 :" max π ř tPU h πptq |qptq| , where U h :" tt P U h : Et 1 P U h such that t is a prefix of t 1 u. We now define the B-stability condition, which simply requires the B-operators tB H:h u hPrHs to have bounded operator norms from the fused-norm to the Π-norm.  }B H:h q} Π ď Λ B . When using the B-stability condition, we will often take q " q 1 pτ h´1 q ´q2 pτ h´1 q to be the difference between two predictive states at step h. Intuitively, Definition 4 requires that the propagated Π-norm error }B H:h pq 1 ´q2 q} Π to be controlled by the original fused-norm error }q 1 ´q2 } ˚. The fused-norm }¨} ˚is equivalent to the vector 1-norm up to a |U A,h | 1{2 -factor (despite its seemingly involved form): We have }q} ˚ď }q} 1 ď |U A,h | 1{2 }q} ˚(Lemma D.6), and thus assuming a relaxed condition max }q}1"1 }B H:h } Π ď Λ will also enable sample-efficient learning of PSRs. However, we consider the fused-norm in order to obtain the sharpest possible sample complexity guarantees. Finally, all of our theoretical results still hold under a more relaxed (though less intuitive) weak B-stability condition (Definition D.4), with the same sample complexity guarantees. (See also the additional discussions in Appendix D.2.)

3.2. RELATION WITH KNOWN SAMPLE-EFFICIENT SUBCLASSES

We show that the B-stability condition encompasses many known structural conditions of PSRs and POMDPs that enable sample-efficient learning. Throughout, for a matrix A P R mˆn , we define its operator norm }A} pÑq :" max }x}pď1 }Ax} q , and use }A} p :" }A} pÑp for shorthand. Weakly revealing POMDPs (Jin et al., 2020a; Liu et al., 2022a ) is a subclass of POMDPs that assumes the current latent state can be probabilistically inferred from the next m emissions. Example 5 (Multi-step weakly revealing POMDPs). A POMDP is called m-step α rev -weakly revealing (henceforth also "α rev -revealing") with α rev ď 1 if max hPrH´m`1s }M : h } 2Ñ2 ď α ´1 rev , where for h P rH ´m `1s, M h P R O m A m´1 ˆS is the m-step emission-action matrix at step h, defined as rM h s po,aq,s :" Ppo h:h`m´1 " o|s h " s, a h:h`m´2 " aq, @po, aq P O m ˆAm´1 , s P S. (7) We show that any m-step α rev -weakly revealing POMDP is a Λ B -stable PSR with core test sets U h " pO ˆAq mintm´1,H´hu ˆO, and Λ B ď ? Sα ´1 rev (Proposition D.7). A similar result holds for the ℓ 1 variant of the revealing condition (see Appendix D.3.1). ♢ When the transition matrix T h of the POMDP has a low rank structure, Wang et al. (2022) show that a subspace-aware generalization of the ℓ 1 -revealing condition-the future-sufficiency conditionenables sample-efficient learning of POMDPs with possibly enormous state/observation spaces (see also Cai et al. (2022) ). We consider the following generalized definition of future-sufficiency. Example 6 (Low-rank future-sufficient POMDPs). We say a POMDP has transition rank d trans if for each h P rH ´1s, the transition kernel of the POMDP has rank at most d trans (i.e. max h rankpT h q ď d trans ). It is clear that low-rank POMDPs with transition rank d trans has PSR rank d PSR ď d trans . A transition rank-d trans (henceforth rank-d trans ) POMDP is called m-step ν-future-sufficient with ν ě 1, if for h P rH ´1s, there exists Mfoot_5 h P R SˆU h such that M 6 h M h T h´1 " T h´1 and }M 6 h } 1Ñ1 ď ν, where M h is the m-step emission-action matrix defined in (7). 6 We show that any m-step ν-future sufficient rank-d trans POMDP is a B-stable PSR with core test sets U h " pO ˆAq mintm´1,H´hu ˆO, d PSR ď d trans , and Λ B ď ? A m´1 ν (Proposition D.12). ♢ Decodable POMDPs (Efroni et al., 2022) , as a multi-step generalization of Block MDPs (Du et al., 2019) , assumes the current latent state can be perfectly decoded from the recent m observations. Example 7 (Multi-step decodable POMDPs). A POMDP is called m-step decodable if there exists (unknown) decoders ϕ ‹ " tϕ ‹ h u hPrHs , such that for every reachable trajectory ps 1 , o 1 , a 1 , ¨¨¨, s h , o h q we have s h " ϕ ‹ h pz h q, where z h " po mphq , a mphq , ¨¨¨, o h q and mphq " maxth ´m `1, 1u. We show that any m-step decodable POMDP is a B-stable PSR with core test sets U h " pO ˆAq mintm´1,H´hu ˆO and Λ B " 1 (Proposition D.17). ♢ Finally, Zhan et al. (2022) define the following regularity condition for general PSRs. Example 8 (Regular PSRs). A PSR is called α psr -regular if for all h P rHs there exists a core matrix K h P R U h`1 ˆrankpD h q , which is a column-wise sub-matrix of D h such that rankpK h q " rankpD h q and max hPrHs }K : h } 1Ñ1 ď α ´1 psr . We show that any α psr -regular PSR is Λ B -stable with Λ B ď ? U A α ´1 psr (Proposition D.18). ♢ We emphasize that B-stability not only encompasses α psr -regularity, but is also strictly more expressive. For example, decodable POMDPs are not α psr -regular unless with additional assumptions on K : h (Zhan et al., 2022, Section 6.5), whereas they are B-stable with Λ B " 1 (Example 7). Also, any α rev -revealing POMDP is α psr -regular with some α ´1 psr ă 8, but with α ´1 psr potentially not polynomially bounded by α ´1 rev (and other problem parameters) due to the restriction of K h being a column-wise sub-matrix of D h ; By contrast it is B-stable with Λ B ď ? Sα ´1 rev (Example 5).

4. LEARNING B-STABLE PSRS

In this section, we show that B-stable PSRs can be learned sample-efficiently, achieved by three model-based algorithms simultaneously. We instantiate our results to POMDPs in Section 5. Set pθ k , π k q " arg max θPΘ k ,π V θ pπq.

5:

for h " 0, . . . , H ´1 do 6: Set exploration policy π k h,exp :" π k ˝h UnifpAq ˝h`1 UnifpU A,h`1 q. 7: Execute π k h,exp to collect a trajectory τ k,h , and add pπ k h,exp , τ k,h q into D.

8:

Update confidence set Θ k`1 " " p θ P Θ : ř pπ,τ qPD log P π p θ pτ q ě max θPΘ ř pπ,τ qPD log P π θ pτ q ´β* . Output: p π out :" Unifp ␣ π k ( kPrKs q.

4.1. OPTIMISTIC MAXIMUM LIKELIHOOD ESTIMATION (OMLE)

The OMLE algorithm is proposed by Liu et al. (2022a) for learning revealing POMDPs and adaptedfoot_6 by Zhan et al. ( 2022) for learning regular PSRs, achieving polynomial sample complexity (in relevant problem parameters) in both cases. We show that OMLE works under the broader condition of B-stability, with significantly improved sample complexities. Algorithm and theoretical guarantee The OMLE algorithm (described in Algorithm 1) takes in a class of PSRs Θ, and performs two main steps in each iteration k P rKs: 1. (Optimism) Construct a confidence set Θ k Ď Θ, which is a superlevel set of the log-likelihood of all trajectories within dataset D (Line 8). The policy π k is then chosen as the greedy policy with respect to the most optimistic model within Θ k (Line 4). 2. (Data collection) Execute exploration policies pπ k h,exp q 0ďhďH´1 , where each π k h,exp is defined via the ˝h notation as follows: Follow π k for the first h ´1 steps, take a uniform action UnifpAq at step h, take an action sequence sampled from UnifpU A,h`1 q at step h `1, and behave arbitrarily afterwards (Line 6). All collected trajectories are then added into D (Line 7). Intuitively, the concatenation of the current policy π k with UnifpAq and UnifpU A,h`1 q in Step 2 above is designed according to the structure of PSRs to foster exploration. Theorem 9 (Guarantee of OMLE). Suppose every θ P Θ is Λ B -stable (Definition 4) and the true model θ ‹ P Θ has rank d PSR ď d. Then, choosing β " C logpN Θ p1{KHq{δq for some absolute constant C ą 0, with probability at least 1 ´δ, Algorithm 1 outputs a policy p π out P ∆pΠq such that V ‹ ´Vθ ‹ pp π out q ď ε, as long as the number of episodes T " KH ě O ´dAU A H 2 logpN Θ p1{T q{δqι ¨Λ2 B {ε 2 ¯, where ι :" log p1 `KdU A Λ B R B q, with R B :" max h t1, max }v} 1 "1 ř o,a }B h po, aqv} 1 u. Theorem 9 shows that OMLE is sample-efficient for any B-stable PSRs-a broader class than in existing results for the same algorithm (Liu et al., 2022a; Zhan et al., 2022) -with much sharper sample complexities than existing work when instantiated to their settings. Importantly, we achieve the first polynomial sample complexity that scales with Λ 2 B dependence B-stability parameter (or regularity parameters alikefoot_7 ). Instantiating to α psr -regular PSRs, using Λ B ď ?  U A α ´1 psr (Exam- ple 8), " D 2 H `Pπ θ , P π θ ˘‰ ) , where D 2 H pP π θ , P π θ q :" ř τ H pP π θ pτ H q 1{2 ´Pπ θ pτ H q 1{2 q 2 denotes the squared Hellinger distance between P π θ and P π θ . Intuitively, the EDEC measures the optimal trade-off on model class Θ between gaining information by an "exploration policy" π " p exp and achieving near-optimality by an "output policy" π " p out . Chen et al. (2022) further design the EXPLORATIVE E2D algorithm, a general model-based RL algorithm with sample complexity scaling with the EDEC. We sketch the EXPLORATIVE E2D algorithm for a PSR class Θ as follows (full description in Algorithm 2): In each episode t P rT s, we maintain a distribution µ t P ∆pΘ 0 q over an optimistic cover p r P, Θ 0 q of Θ with radius 1{T (cf. Definition C.4), which we use to compute two policy distributions pp t exp , p t out q by minimizing the following risk: pp t out , p t exp q " arg min ppout,pexpqP∆pΠq 2 sup θPΘ E π"pout rV θ pπ θ q ´Vθ pπqs ´γE π"pexp E θ t "µ t " D 2 H pP π θ , P π θ t q ‰ . Then, we sample policy π t " p t exp , execute π t and collect trajectory τ t , and update the model distribution µ t using a Tempered Aggregation scheme, which performs a Hedge update with initialization µ 1 " UnifpΘ 0 q, the log-likelihood loss with r P π t θ p¨q denoting the optimistic likelihood associated with model θ P Θ 0 and policy π t (cf. Definition C.4), and learning rate η ď 1{2: µ t`1 pθq 9 θ µ t pθq ¨exp ´η log r P π t θ pτ t q ¯. After T episodes, we output the average policy p π out :" 1 T ř T t"1 p t out . Theoretical guarantee We provide a sharp bound on the EDEC for B-stable PSRs, which implies that EXPLORATIVE E2D can also learn them sample-efficient efficiently. Theorem 10 (Bound on EDEC & Guarantee of EXPLORATIVE E2D). Suppose Θ is a PSR class with the same core test sets tU h u hPrHs , and each θ P Θ admits a B-representation that is Λ B -stable and has PSR rank at most d. Then we have edec γ pΘq ď OpdAU A Λ 2 B H 2 {γq. As a corollary, with probability at least 1 ´δ, Algorithm 2 outputs a policy p π out P ∆pΠq such that V ‹ ´Vθ ‹ pp π out q ď ε, as long as the number of episodes T ě O `dAU A Λ 2 B H 2 logpN Θ p1{T q{δq{ε 2 ˘. The sample complexity (9) matches OMLE (Theorem 9) and has a slight advantage in avoiding the log factor ι therein. In return, the d in Theorem 10 needs to upper bound the PSR rank of all models in Θ, whereas the d in Theorem 9 only needs to upper bound the rank of the true model θ ‹ . We also remark that EXPLORATIVE E2D explicitly requires an optimistic covering of Θ as an input to the algorithm, which may be another disadvantage compared to OMLE (which uses optimistic covering implicitly in the analyses only). The proof of Theorem 10 (in Appendix I.2) relies on mostly the same key steps as for analyzing the OMLE algorithm (overview in Appendix B). 2022), the only difference is that their covering number N G is for the value class while N Θ is for the model class. However, this difference is nontrivial if the model class admits a much smaller covering number than the value class required for a concrete problem. For example, for tabular decodable POMDPs, using d trans ď S and log N Θ ď r OpHpS 2 A `SOqq, we achieve the first r OpA m polypH, S, O, Aq{ε 2 q sample complexity, which resolves the open question of Efroni et al. (2022) . Besides the above, our results can be further instantiated to latent MDPs (Kwon et al. (2021) , as a special case of revealing POMDPs) and linear POMDPs (Cai et al., 2022) and improve over existing results, which we present in Appendix D.3.2 & D.3.4.

6. CONCLUSION

This paper proposes B-stability-a new structural condition for PSRs that encompasses most of the known tractable partially observable RL problems-and designs algorithms for learning B-stable PSRs with sharp sample complexities. We believe our work opens up many interesting questions, such as the computational efficiency of our algorithms, alternative (e.g. model-free) approaches for learning B-stable PSRs, or extensions to multi-agent settings. Stephan Zheng, Alexander Trott, Sunil Srinivasa, Nikhil Naik, Melvin Gruesbeck, David C Parkes, and Richard Socher. The ai economist: Improving equality and productivity with ai-driven tax policies. arXiv preprint arXiv:2004.13332, 2020.

A RELATED WORK

Learning POMDPs Due to the non-Markovian nature of observations, policies in POMDPs in general depend on the full history of observations, and thus are much harder to learn than in fully observable MDPs. It is well-established that learning a near-optimal policy in POMDPs is indeed statistically hard in the worst-case, due to a sample complexity lower bound that is exponential in the horizon (Mossel & Roch, 2005; Krishnamurthy et al., 2016) . Algorithms achieving such upper bounds are developed in (Kearns et al., 1999; Even-Dar et al., 2005) 2022); These works assume exploratory data or reachability assumptions, and thus do not address the challenge of exploration. For learning POMDPs in the online (exploration) setting, sample-efficient algorithms have been proposed under various structural conditions, including reactiveness (Jiang et al., 2017) , revealing conditions (Jin et al., 2020a; Liu et al., 2022a; c) , revealing (future/past-sufficiency) and low rank (Cai et al., 2022; Wang et al., 2022) , decodablity (Efroni et al., 2022) , latent MDP (Kwon et al., 2021) , learning short-memory policies (Uehara et al., 2022b), and deterministic transitions (Uehara et al., 2022a). Our B-stability condition encompasses most of these structural conditions, through which we provide a unified analysis with significantly sharper sample complexities (cf. Section 3 & 5). We further remark that for tabular revealing POMDPs, our sample complexities are minimax optimal in the accuracy ε and the revealing constant, and have at most a small polynomial gap in S, O, A factors from the minimax optimal rate, due to the lower bounds established in the work of Chen et al. ( 2023) (see e.g. their Table 1 ) after the initial appearance of this work. For the computational aspect, planning in POMDPs is known to be PSPACE-compete (Papadimitriou & Tsitsiklis, 1987; Littman, 1994; Burago et al., 1996; Lusena et al., 2001) . The recent work of Golowich et al. (2022b; a) establishes the belief contraction property in revealing POMDPs, which leads to algorithms with quasi-polynomial statistical and computational efficiency. Uehara et al. (2022a) design computationally efficient algorithms under the deterministic latent transition assumption. We remark that computational efficiency is beyond the scope of this paper, but is an important direction for future work. Extensive-Form Games with Imperfect Information (EFGs; (Kuhn, 1953) ) is an alternative formulation of partial observability in sequential decision-making. EFGs can be formulated as Partially Observable Markov Games (the multi-agent version of POMDPs (Liu et al., 2022c) ) with a treestructure. Learning from bandit feedback in EFGs has been recently studied in 2013) as a general formulation of partially observable systems, following the idea of Observable Operator Models (Jaeger, 2000) . POMDPs can be seen as a special case of PSRs (Littman & Sutton, 2001) . Algorithms for learning PSRs have been designed assuming reachability or exploratory data, including spectral algorithms (Boots et al., 2011; Zhang et al., 2021; Jiang et al., 2018) , supervised learning (Hefny et al., 2015) , and others (Hamilton et al., 2014; Thon & Jaeger, 2015; Grinberg et al., 2018) . Closely related to us, the very recent work of Zhan et al. ( 2022) develops the first sample-efficient algorithm for learning PSRs in the online setting assuming under a regularity condition. Our work provides three algorithms with sharper sample complexities for learning PSRs, under the more general condition of B-stability. A concurrent work by Liu et al. (2022b) (released on the same day as this work) also identifies a general class of "well-conditioned" PSRs that can be learned sample-efficiently by the OMLE algorithm (Liu et al., 2022a) . Our B-stability condition encompasses and is slightly more relaxed than their condition (consisting of two parts), whose part one is similar to the operator norm requirement in B-stability with a different choice of input norm, and which requires an additional second part. Next, our sample complexity is much tighter than that of Liu et al. (2022b) , on both general wellconditioned/B-stable PSRs and the specific examples encompassed (such as revealing POMDPs). For example, for the general class of "γ well-conditioned PSRs" considered in their work, our results imply a r O `dAU 2 A H 2 log N Θ {pγ 2 ε 2 q ˘sample complexity, whereas their result scales as r O `d2 A 5 U 3 A H 4 log N Θ {pγ 4 ε 2 q ˘(extracted from their proofs, cf. Appendix D.4). This originates from several differences between our techniques: First, Liu et al. (2022b) 's analysis of the OMLE algorithm is based on an ℓ 1 -type operator error bound for PSRs, combined with an ℓ 1 -Eluder argument, whereas our analysis is based on a new stronger ℓ 2 -type operator error bound for PSRs (Proposition F.2) combined with a new generalized ℓ 2 -Eluder argument (Proposition E.1), which together results in a sharper rate. Besides, our ℓ 2 -Eluder argument also admits an in-expectation de-coupling form as a variant (Proposition E.6) that is necessary for bounding the EDEC (and hence the sample complexity of the EXPLORATIVE E2D algorithm) for B-stable PSRs; it is unclear whether their ℓ 1 -Eluder argument can give the same results. Another difference is that our performance decomposition and Eluder argument are done on a slightly difference choice of vectors from Liu et al. (2022b) , which is the main reason for our better 1{γ dependency (or Λ B dependency for B-stable PSRs); See Appendix B for a detailed overview of our technique. Further, in terms of algorithms, Liu et al. (2022b) only study the OMLE algorithm, whereas we study both OMLE and two alternative algorithms Explorative E2D & MOPS in addition, which enjoy similar guarantees (with minor differences) as OMLE. In summary, Liu et al. (2022b) do not overlap with our contributions (2) and (3) highlighted in our abstract. Finally, complementary to our work, Liu et al. (2022b) identify new concrete problems such as observable POMDPs with continuous observations, and develop new techniques to show that they fall into both of our general PSR frameworks, and thus tractable to sample-efficient learning. In particular, their result implies that this class is contained in (an extension of) the low-rank future-sufficient POMDPs defined in Definition D.11, if we suitably extend the formulation in Definition D.11 to the continuous observation setting by replacing vectors with L 1 -integrable functions and matrices with linear operators. RL with function approximation (Fully observable) RL with general function approximation has been extensively studied in a recent line of work (Jiang et al., 2017; Sun et al., 2019; Du et al., 2021; Jin et al., 2021; Foster et al., 2021; Agarwal & Zhang, 2022; Chen et al., 2022) , where sampleefficient algorithms are constructed for problems admitting bounds in certain general complexity measures. While POMDPs/PSRs can be cast into their settings by treating the history pτ h´1 , o h q as the state, prior to our work, it was highly unclear whether any sample-efficient learning results can be deduced from their results due to challenges in bounding the complexity measures (Liu et al., 2022a) . Our work answers this positively by showing that the Decision-Estimation Coefficient (DEC; Foster et al. ( 2021)) for B-stable PSRs is bounded, using an explorative variant of the DEC defined by Chen et al. (2022) , thereby showing that their EXPLORATIVE E2D algorithm and the closely related MOPS algorithm (Agarwal & Zhang, 2022) are both sample-efficient for B-stable PSRs. Our work further corroborates the connections between E2D, MOPS, and OMLE identified in (Chen et al., 2022) in the setting of partially observable RL.

B OVERVIEW OF TECHNIQUES

The proof of Theorem 9 consists of three main steps: a careful performance decomposition into certain B-errors, bounding the squared B-errors by squared Hellinger distances, and a generalized ℓ 2 -Eluder argument. The proof of (the EDEC bound in) Theorem 10 follows similar steps except for replacing the final Eluder argument with a decoupling argument (Proposition E.6). Step 1: Performance decomposition By the standard excess risk guarantee for MLE, our choice of β " OplogpN Θ p1{T q{δqq guarantees with probability at least 1 ´δ that θ ‹ P Θ k for all k P rKs (Proposition G.2(a)). Thus, the greedy step (Line 4 in Algorithm 1) implies valid optimism: V ‹ ď V θ k pπ k q. We then perform an error decomposition (Proposition F.1): V ‹ ´Vθ ‹ pπ k q ď V θ k pπ k q ´Vθ ‹ pπ k q ď D TV ´Pπ k θ k , P π k θ ‹ ¯ď ř H h"0 E τ h´1 "π k " E ‹ k,h pτ h´1 q ı , where E ‹ k,0 :" 1 2 › › B k H:1 `qk 0 ´q‹ 0 ˘› › Π , E ‹ k,h pτ h´1 q :" max π 1 2 ÿ o h ,a h πpa h |o h q › › B k H:h`1 `Bk h po h , a h q ´B‹ h po h , a h q ˘q‹ pτ h´1 q › › Π , where for the ground truth PSR θ ‹ and the OMLE estimates θ k from Algorithm 1, we have defined respectively tB ‹ h , q ‹ 0 u and tB k h , q k 0 u as their B-representations, and tB ‹ H:h u and tB k H:h u as the corresponding B-operators. (10) follows by expanding the P π k θ k pτ q and P π k θ ‹ pτ q (within the TV distance) using the B-representation and telescoping (Proposition F.1). This decomposition is similar as the ones in Liu et al. (2022a) ; Zhan et al. (2022) , and more refined by keeping the B k H:h`1 term in (11) (instead of bounding it right away), and using the Π-norm (3) instead of the ℓ 1 -norm as the error metric. Step 2: Bounding the squared B-errors By again the standard fast-rate guarantee of MLE in squared Hellinger distance (Proposition G.2(b)), we have ř k´1 t"1 ř H h"0 D 2 H pP π t h,exp θ k , P π t h,exp θ ‹ q ď 2β for all k P rKs. Next, using the B-stability of the PSR, we have for any 1 ď t ă k ď K that (Proposition F.2) ř H h"0 E π t " E ‹ k,h pτ h´1 q 2 ı ď 32Λ 2 B AU A ř H h"0 D 2 H ˆPπ t h,exp θ k , P π t h,exp θ ‹ ˙. ( ) Plugging the MLE guarantee into (12) and summing over t P rk ´1s yields that for all k P rKs, ř k´1 t"1 ř H h"0 E π t " E ‹ k,h pτ h´1 q 2 ı ď O `Λ2 B AU A β ˘. (13) is more refined than e.g. Liu et al. (2022a, Lemma 11) , as (13) controls the second moment of E θ,h , whereas their result only controls the first moment of a similar error. Step 3: Generalized ℓ 2 -Eluder argument We now have (13) as a precondition and bounding (10) as our target. The only remaining difference is that (13) controls the error E ‹ k with respect to tπ t u tďk´1 , whereas (10) requires controlling the error E ‹ k with respect to π k . To this end, we perform a generalized ℓ 2 -Eluder dimension argument adapted to the structure of the function E ‹ k 's (Proposition E.1), which implies that when d PSR ď d, ˜k ÿ t"1 E π t " E ‹ t,h pτ h´1 q ‰ ¸2 À dι ¨˜k `k ÿ t"1 t´1 ÿ s"1 E π s " E ‹ t,h pτ h´1 q 2 ‰ ¸, @pk, hq P rKs ˆrHs. ( ) Note that such an ℓ 2 -type Eluder argument is allowed precisely as our precondition ( 13) is in ℓ 2 whereas our target ( 10 Lemma C.1 (Hellinger conditioning lemma (Chen et al., 2022, Lemma A.4 )). For any pair of random variable pX, Y q, it holds that E X"P X " D 2 H `PY |X , Q Y |X ˘‰ ď 2D 2 H pP X,Y , Q X,Y q . The following strong duality of (generalized  ) inf Y P∆pYq E x"X E y"Y rf px, yqs " inf Y P∆pYq sup xPX E y"Y rf px, yqs, where ∆ 0 pX q stands for space of the finitely supported distribution on X . We will also use the following standard concentration inequality (see e.g. Foster et al. (2021, Lemma A.4 )) when analyzing algorithm OMLE. Lemma C.3. For a sequence of real-valued random variables pX t q tďT adapted to a filtration pF t q tďT , the following holds with probability at least 1 ´δ: t ÿ s"1 ´log E r expp´X s q| F s´1 s ď t ÿ s"1 X s `log p1{δq , @t P rT s.

C.2 COVERING NUMBER

In this section, we present the definition of the optimistic covering number N Θ . Suppose that we have a model class Θ, such that each θ P Θ parameterizes a sequential decision process. The ρ-optimistic covering number of Θ is defined as follows. Definition C.4 (Optimistic cover). Suppose that there is a context space X . An optimistic ρ-cover of Θ is a tuple p r P, Θ 0 q, where Θ 0 Ă Θ is a finite set, r P " ! r P π θ0 p¨q P R T H ě0 ) θ0PΘ0,πPΠ specifies a optimistic likelihood function for each θ P Θ 0 , such that: (1) For θ P Θ, there exists a θ 0 P Θ 0 satisfying: for all τ P T H and π, it holds that r P π θ0 pτ q ě P π θ pτ q. (2) For θ P Θ 0 , max π › › ›P π θ pτ H " ¨q ´r P π θ pτ H " ¨q› › › 1 ď ρ 2 . The optimistic covering number N Θ pρq is defined as the minimal cardinality of Θ 0 such that there exists r P such that p r P, M 0 q is an optimistic ρ-cover of Θ. The above definition is taken from A sequential decision process is a PSR with core test sets pU h q hPrHs (in the sense of Definition 1) if and only if it admits a B-representation with respect to pU h q hPrHs (in the sense of Definition 2). Proof of Proposition D.1. We first show that a PSR admits a B-representation. Suppose we have a PSR with core test sets pU h q hPrHs satisfying Definition 1, with associated vectors tb t h ,h P R U h u hPrHs,t h PT given by (1). Then, define B h po, aq :" » - | b J po,a,tq,h | fi fl tPU h`1 P R U h`1 ˆUh , q 0 :" « | Pptq | ff tPU1 P R U1 . We show that this gives a B-representation of the PSR. By (1), we have for all ph, τ h´1 , o, aq that B h po, aqqpτ h´1 q " rPpo, a, t h`1 |τ h´1 qs t h`1 PU h`1 " Ppo h " o|τ h´1 q ˆqpτ h´1 , o, aq. Applying this formula recursively, we obtain B h:1 pτ h qq 0 " Ppτ h q ˆqpτ h q " rPpτ h , t h`1 qs t h`1 PU h`1 , which completes the verification of (2) in Definition 2. We next show that a process admitting a B-representation is a PSR. Suppose we have a sequential decision process that admits a B-representation with respect to pU h q hPrHs as in Definition 2. Fix h P rHs. We first claim that, to construct vectors pb t h ,h q t h P R U h such that Ppt h |τ h´1 q " xb t,h , qpτ h´1 qy for all test t h and history τ h´1 (Definition 1), we only need to construct such vectors for full-length tests t h " po h:H`1 , a h:H q. This is because, suppose we have assigned b t h ,h P R U h for all full-length t h 's. Then for any other t h " po h:h`W ´1, a h:h`W ´2q with h `W ´1 ă H `1 (non-full-length), take b t h ,h " ÿ o h`W :H`1 b t h ,po h`W :H`1 ,a 1 h`W ´1:H q,h , where a 1 h`W ´1:H P A H´h´W `2 is an arbitrary and fixed action sequence. For this choice we have xb t,h , qpτ h´1 qy " ÿ o h`W :H`1 A b t h ,po h`W :H`1 ,a 1 h`W ´1:H q,h , qpτ h´1 q E " ÿ o h`W :H`1 Ppt h , o h`W :H`1 , a 1 h`W ´1:H |τ h´1 q " Ppt h |τ h´1 q as desired. It remains to construct b t h ,h for all full-length tests. For any full-length test t h " po h:H`1 , a h:H q, take b t h ,h P R U h with b J t h ,h " B H po H , a H q ¨¨¨B h po h , a h q P R 1ˆU h . By definition of the B-representation, for any history τ h " po 1 , a 1 , ¨¨¨, o h , a h q, and any test t h`1 P U h`1 , we have Ppτ h q ˆPpt h`1 |τ h q " e J t h`1 B h:1 pτ h q ˆq0 , or in vector form, Ppτ h q ˆqpτ h q " B h:1 pτ h qq 0 , where we recall Ppτ h q " Ppo 1 , ¨¨¨, o h |dopa 1 , ¨¨¨, a h qq. Therefore, for the particular full history τ H " pτ h´1 , t h q, we have by applying ( 15) twice (for steps H and h ´1) that Ppτ H q " B H:1 pτ H qq 0 " B H:h po h:H , a h:H qB h´1:1 pτ h´1 qq 0 " b J t h ,h pPpτ h´1 q ˆqpτ h´1 qq. Dividing both sides by Ppτ h´1 q (when it is nonzero), we get Ppt h |τ h´1 q " Ppτ H |τ h´1 q " Ppτ H q{Ppτ h´1 q " b J t h ,h qpτ h´1 q. ( ) This verifies (1) for all τ h´1 that are reachable. For τ h´1 that are not reachable, (16) also holds as both sides equal zero by our convention. This completes the verification of (1) in Definition 1. From the proof above, we can extract the following basic property of B-representation. Corollary D.2. Consider a PSR model with B-representation ttB h po h , a h qu h,o h ,a h , q 0 u. For 0 ď h ď H ´1, it holds that Ppo h |τ h´1 q ˆqpτ h´1 , o h , a h q " B h po h , a h qqpτ h´1 q. Furthermore, it holds that B H:h pτ h:H qqpτ h´1 q " Ppτ h:H |τ h´1 q. (17)

D.2 WEAK B-STABILITY CONDITION

In this section, we define a weaker structural condition on PSRs, named the weak B-stability condition. In the remaining appendices, the proofs of our main sample complexity guarantees (Theorem 9, 10, H.4, H.6) will then assume the less-stringent weak B-stability condition of PSRs. Therefore, these main results will hold under both Λ B -stablility (Definition 4) and weak Λ B -stablility (Definition D.4) simultaneously. To define weak B-stability, we first extend our definition of Π-norm to R T for any set T of tests. Recall that in (3), we have defined Π-norm on R T with T " pO ˆAq H´h (and in (5), the Π 1 -norm for T " U h ). Definition D.3 (Π-norm for general test set). For T Ă T, we equip R T with }¨} Π defined by }v} Π :" max T 1 ĂT max π ÿ tPT 1 πptq |vptq| , v P R T where max T 1 ĂT is taken over all subsets T 1 of T such that T 1 satisfies the prefix condition: there is no two t ‰ t 1 P T 1 such that t is a prefix of t 1 . It is straightforward to see that, for any v P R U h , we have }v} 1 ě }v} Π ě }v} Π 1 Definition D.4 (Weak B-stability). A PSR is weakly B-stable with parameter Λ B ě 1 (henceforth weakly Λ B -stable) if it admits a B-representation and associated B-operators tB H:h u hPrHs such that, for any h P rHs and p, q P R U h ě0 , we havefoot_10  }B H:h pp ´qq} Π ď Λ B b 2p}p} Π `}q} Π q } ? p ´?q} 2 , Despite the seemingly different form, we can show that the weak B-stability condition is indeed weaker than the B-stability condition. Furthermore, the converse also holds: the B-stability can be implied by the weak B-stability condition, if we are willing to pay a ? 2U A factor. This is given by the proposition below. Proposition D.5. If a PSR is B-stable with parameter Λ B , then it is weakly B-stable with the same parameter Λ B . Conversely, if a PSR is weakly B-stable with parameter Λ B (cf. Definition D.4), then it is B-stable with parameter ? 2U A Λ B . Proof of Proposition D.5. We first show that B-stability implies weak B-stability. Fix a h P rHs. We only need to show that, for p, q P R U h ě0 , we have }p ´q} ˚ď b 2p}p} Π `}q} Π q } ? p ´?q} 2 . ( ) We show this inequality by showing the bound for the p1, 2q-norm and the Π 1 -norm separately. . Combining these two inequalities completes the proof of Eq. ( 19), which gives the first claim of Proposition D.5. Next, we show that weak B-stability implies B-stability up to a ? 2U A factor. Fix a h P rHs. For x P R U h , we take p " rxs `, q " rxs ´, then it suffices to show that b 2p}p} Π `}q} Π q } ? p ´?q} 2 ď a 2U A }x} ˚. Indeed, we have } ? p ´?q} 2 " › › › a |x| › › › 2 " b }x} 1 , }p} Π `}q} Π ď › › rxs `› › 1 `› › rxs ´› › 1 " }x} 1 . This implies that b 2p}p} Π `}q} Π q } ? p ´?q} 2 ď ? 2}x} 1 . Applying Lemma D.6 completes the proof of Eq. ( 20), and hence proves the second claim of Proposition D.5. Lemma D.6. Consider the fused-norm as defined in Eq. ( 4). For any q P R U h , we have }q} ˚ď }q} 1 ď |U A,h | 1{2 }q} ˚. Proof of Lemma D.6. By definition, we clearly have }q} 1,2 ď }q} 1 and }q} Π 1 ď }q} 1 . On the other hand, by Cauchy-Schwarz inequality, }q} 2 1 " ¨ÿ po,aqPU h |qpo, aq| ‚2 ď |U A,h | ÿ aPU A,h ¨ÿ o:po,aqPU A,h |qpo, aq| ‚2 ď |U A,h | }q} 2 ˚. Combining the inequalities above completes the proof. D.3 PROOFS FOR SECTION 3.2 D.3.1 REVEALING POMDPS ℓ 1 revealing condition We first remark that, besides the revealing condition using the ℓ 2 norms defined in Example 5, we also consider the ℓ 1 version of the revealing condition, which measures the ℓ 1 -operator norm of any left inverse M h of M h , instead of the ℓ 2 -operator norm of the pseudoinverse M : h . Concretely, we say a POMDP satisfies the m-step α rev,ℓ1 ℓ 1 -revealing condition, if there exists a matrix M h such that M h M h " I and }M h } 1Ñ1 ď α ´1 rev,ℓ1 . In Proposition D.7, we also show that any m-step α rev,ℓ1 ℓ 1 -revealing POMDP is a Λ B -stable PSR with core test sets U h " pO ˆAq mintm´1,H´hu ˆO, and Λ B ď ? A m´1 α ´1 rev,ℓ1 . We consider Example 5, and show that any m-step revealing POMDP admit a B-representation that is B-stable. By definition, the initial predictive state is given by q 0 " M 1 µ 1 . For h ď H ´m, we take B h po h , a h q " M h`1 T h,a h diagpO h po h |¨qqM h P R U h`1 ˆUh , where T h,a :" T h p¨|¨, aq P R SˆS is the transition matrix of action a P A, and M h is any left inverse of M h . When h ą H ´m, we only need to take B h po h , a h q " r1pt h " po h , a h , t h`1 qqs pt h`1 ,t h qPU h`1 ˆUh P R U h`1 ˆUh , where 1pt h " po h , a h , t h`1 qq is 1 if t h equals to po h , a h , t h`1 q, and 0 otherwise. Proposition D.7 (Weakly revealing POMDPs are B-stable). For m-step revealing POMDP, ( 22) and (23) indeed give a B-representation, which is B-stable with Λ B ď max h › › M h › › ˚Ñ1 , where › › M h › › ˚Ñ1 :" max xPR U h :}x} ˚ď1 › › M h x › › 1 . Therefore, any m-step α rev -weakly revaling POMDP is B-stable with Λ B ď ? Sα ´1 rev (by taking `" :, using }¨} 2 ď }¨} ˚, and }¨} 1 ď ? S }¨} 2 ). Similarly, any m-step α rev,ℓ1 ℓ 1 -revealing POMDP is B-stable with Λ B ď ? A m´1 α ´1 rev,ℓ1 (using }¨} 1 ď ? U A }¨} ˚with U A " A m´1 ). For succinctness, we only provide the proof of a more general result (Proposition D.12). Besides, by a similar argument, we can also show that the parameter R B that appears in Theorem 9 can be bounded by R B ď α ´1 rev A m (for α rev -weakly revealing POMDP) or R B ď α ´1 rev A m (for α rev,ℓ1 ℓ 1 -revealing POMDP, see e.g. Lemma D.13).

D.3.2 LATENT MDPS

In this section, we follow Kwon et al. (2021) to show that latent MDPs as a sub-class of POMDPs, and then obtain the sample complexity for learning latent MDPs of our algorithms. Example D.8 (Latent MDP). A latent MDP M is specified by a tuple ␣ S, A, pM m q N m"1 , H, ν ( , where M 1 , ¨¨¨, M N are N MDPs with joint state space S, joint action space A, horizon H, and ν P ∆prN sq is the mixing distribution over M 1 , ¨¨¨, M N . For m P rN s, the transition dynamic of M m is specified by pT h,m : S ˆA Ñ ∆pSqq H h"1 along with the initial state distribution µ m , and at step h the binary random reward r h,m is generated according to probability R h,m : S ˆA Ñ r0, 1s. Clearly, M can be casted into a POMDP M 1 with state space S " rN s ˆS ˆt0, 1u and observation space O " S ˆt0, 1u by considering the latent state being sh " ps h , r h , mq P S and observation being o h " ps h , r h´1 q P O. More specifically, at the start of each episode, the environment generates a m " ν and a state s " µ m , then the initial latent state is s1 " pm, s, 0q and o 1 " ps, 0q; at each step h, the agent takes a h after receiving o h , then the environment generates r h P t0, 1u, sh`1 and o h`1 according to ps h , a h q: r h " 1 with probability R h,m ps h , a h qfoot_11 , s h`1 " T h,m p¨|s h , a h q, sh`1 " pm, s h`1 , r h q and o h`1 " ps h`1 , r h q.foot_12 In a latent MDP, we denote T h to be the set of all possible sequences of the form pa h , r h , s h`1 , ¨¨¨, a h`l´1 , r h`l´1 , s h`l q (called a test in (Kwon et al., 2021) ). For h ď H ´l `1, t " pa h , r h , s h`1 , ¨¨¨, a h`l´1 , r h`l´1 , s h`l q P T h and s P S, we can define P h,m pt|sq " P m pr h , s h`1 , ¨¨¨, r h`l´1 , s h`l |s h " s, dopa h , ¨¨¨, a h`l´1 qq, where P m stands for the probability distribution under MDP M m . Definition D.9 (Sufficient tests for latent MDP). A latent MDP M is said to be l-step test-sufficient, if for h P rH ´l `1s and s P S, the matrix L h psq given by L h psq :" rP h,m pt|sqs pt,mqPT h ˆrN s P R T h ˆN has rank N . M is l-step σ-test-sufficient if σ N pL h psqq ě σ for all h P rH ´l `1s and s P S. Under test sufficiency, the latent MDP is an pl `1q-step σ-weakly revealing POMDP, as shown in (Zhan et al., 2022, Lemma 12) . Hence, as a corollary of Proposition D.7, using the fact that ˇˇS ˇˇ" 2SN , we have the following result. Proposition D.10 (Latent MDPs are B-stable). For an l-step σ-test-sufficient latent MDP M , its equivalent POMDP M 1 is pl`1q-step σ-weakly revealing, and thus B-stable with Λ B ď ? 2SN σ ´1. Therefore, by a similar reasoning to m-step revealing POMDPs in Section 5 (and Appendix D.3.1), our algorithms OMLE/EXPLORATIVE E2D/MOPS can achieve a sample complexity of r O ˆS2 N 2 A l`1 H 2 log N Θ σ 2 ε 2 ḟor learning ε-optimal policy, where Θ is the class of all such latent MDPs. Further, the optimistic covering number of Θ can be bounded as (similar as (Liu et al., 2022a, Appendix B) and Appendix D.3.4) log N Θ pρq ď r O `N S 2 AH ˘. Thus, we achieve a r O `S4 N 3 A l`2 H 3 σ ´2ε ´2˘s ample complexity. This improves over the result of Kwon et al. (2021) who requires extra assumptions including reachability, a gap between the N MDP transitions, and full rank condition of histories (Kwon et al., 2021, Condition 2.2) . Besides, our result does not require extra assumptions on core histories-which is needed for deriving sample complexities from the α psr -regularity of (Zhan et al., 2022)-which could be rather unnatural for latent MDPs. We remark that the argument above can be generalized to low-rank latent MDPsfoot_13 straightforwardly, achieving a sample complexity of r O `d2 trans N 2 A l`2 H 2 log N Θ {σ 2 ε 2 ˘. For more details, see Appendix D.3.3. Proof of Proposition D.10. As is pointed out by Zhan et al. ( 2022), the pl `1q-step emission matrix of M 1 has a relatively simple form: notice that for h P rH ´l `1s, s " pm, s, rq P S and t h " po h , a h , ¨¨¨, o h`l q P U h (with o h`1 " ps h`1 , r h q, ¨¨¨, o h`l " ps h`l , r h`l´1 q), we have M h pt, sq " 1po h " ps h , r h´1 qqP m pr h , s h`1 , ¨¨¨, r h`l´1 , s h`l |s h " s, dopa h , ¨¨¨, a h`l´1 qq, where 1po h " ps h , r h´1 qq is 1 when o h " ps h , r h´1 q and 0 otherwise. Therefore, up to some permutation, M h has the form M h " » - - - - - - - - - - L h ps p1q q L h ps p1q q L h ps p2q q L h ps p2q q . . . In this section, we provide a detailed discussion of low-rank POMDPs and m-step future sufficiency condition mentioned in Example 6. We present a slightly generalized version of the m-step future sufficiency condition defined in (Wang et al., 2022) ; see also (Cai et al., 2022) . L For low-rank POMDPs, we now state a slightly more relaxed version of the future-sufficiency condition defined in (Wang et al., 2022) . Recall the m-step emission-action matrices M h P R U h ˆS defined in (7). Definition D.11 (m-step ν-future-sufficient POMDP). We say a low-rank POMDP is m-step νfuture-sufficient if for h P rHs, min M 6 h }M 6 h } 1Ñ1 ď ν, where min M 6 h is taken over all possible M 6 h 's such that M 6 h M h T h´1 " T h´1 . Wang et al. ( 2022) consider a factorization of the latent transition: T h " Ψ h Φ h with Ψ h P R Sˆdtrans , Φ h P R dtransˆpSˆAq for h P rHs, and assumes that }M 6 h } 1Ñ1 ď ν with the specific choice M 6 h " Ψ h´1 pM h Ψ h´1 q : (note that it is taking an exact pseudo-inverse instead of any general left inverse). It is straightforward to check that this choice indeed satisfies M 6 h M h T h " T h , using which Definition D.11 recovers the definition of Wang et al. (2022) . It also encompasses the setting of Cai et al. (2022) (m " 1). We show that the following (along with ( 23)) gives a B-representation for the POMDP:foot_14  B h po, aq " M h`1 T h,a diag pO h po|¨qqM 6 h , h P rH ´ms. ( ) This generalizes the choice of B-representation in ( 22) for (tabular) revealing POMDPs, as the matrix M 6 h can be thought of as a "generalized pseudo-inverse" of M h that is aware of the subspace spanned by T h´1 . This choice is more suitable when S or O are extremely large, in which case the vanilla pseudo-inverse M : h may not be bounded in }¨} 1Ñ1 norm. In the tabular case, setting 6 " : in (24) recovers ( 22). Proposition D.12 (Future-sufficient low-rank POMDPs are B-stable). The operators pB h po, aqq h,o,a given by (24) (with the case h ą H ´m given by (23)) is indeed a B-representation, and it is B-stable with Λ B ď ? A m´1 max h }M 6 h } 1 . As a corollary, any m-step ν-future-sufficient low-rank POMDP admits a B-representation with Λ B ď ? A m´1 ν (and also R B ď A m ν). Combining Proposition D.12 and Algorithm 2 gives the sample complexity guarantee of Algorithm 2 for future sufficient POMDP. For Algorithm OMLE, combining R B ď A m ν with Theorem 9 establishes the sample complexity of OMLE, as claimed in Section 5. Proof of Proposition D.12. First, we verify (2) for 0 ď h ď H ´m. In this case, for t h`1 P U h`1 , we havefoot_15  e J t h`1 B h:1 pτ h qq 0 " e J t h`1 h ź l"1 " M l`1 T l,a l diag pO l po l |¨qqM 6 l ı M 1 µ 1 piq "e J t h`1 M h`1 T h,a h diag pO h po h |¨qq ¨¨¨T 1,a1 diag pO 1 po 1 |¨qqµ 1 piiq " ÿ s1,s2,¨¨¨,s h`1 Ppt h`1 |s h`1 qT h,a h ps h`1 |s h qO h po h |s h q ¨¨¨T 1 ps 2 |s 1 qO 1 po 1 |s 1 qµ 1 ps 1 q "Ppτ h , t h`1 q, where (i) is due to M 6 l M l T l " T l for 1 ď l ď h, in (ii) we use the definition (7) to deduce that the pt h`1 , s h`1 q-entry of M h`1 is Ppt h`1 |s h`1 q. Finally, we verify (2) for H ´m ă h ă H. In this case, U h`1 " O H´h ˆAH´h´1 , and hence for τ h " po 1 , a 1 , ¨¨¨, o h , a h q, t h`1 " po h`1 , a h`1 , ¨¨¨, o H q P U h`1 , we consider t H´m`1 " po H´m`1 , a H´m`1 , ¨, o H q: e J t h`1 B h:1 pτ h qq 0 " e J t H´m`1 B H´m:1 pτ H´m qq 0 " Ppt H´m`1 , τ H´m q " Ppt h`1 , τ h q. It remains to verify that the B-representation is Λ B -stable with Λ B ď ? A m´1 ν and R B ď A m ν, we invoke the following lemma. Lemma D.13. For 1 ď h ď H, x P R |U h | , it holds that }B H:h x} Π " max π ÿ τ h:H }B H po H , a H q ¨¨¨B h po h , a h qx} 1 ˆπpτ h:H q ď max !› › ›M 6 h x › › › 1 , }x} Π ) . Similarly, we have ř o,a }B h po, aqv} 1 ď max ! A m › › ›M 6 h v › › › 1 , A }v} 1 ) . By Lemma D.13, it holds that }B H:h x} Π ď max tν }x} 1 , }x} Π u ď ν }x} 1 ď ν a U A }x} ˚" ν ? A m´1 }x} ˚, where the second inequality is because }x} Π ď }x} 1 and ν ě 1, and the third inequality is due to Lemma D.6 and }x} ˚ě }x} Π 1 by definition. Similarly, we have R B ď A m ν. This concludes the proof of Proposition D.12. Proof of Lemma D.13. We first consider the case h ą H ´m. Then for each h ď l ď H, B l is given by ( 23), and hence for trajectory τ h:H " po h , a h , ¨¨¨, o H , a H q and x P R U h , it holds that B h:H pτ h:H qx " xpo h , a h , ¨¨¨, o H q. This implies that }B H:h x} Π " }x} Π and ř o,a }B h po, aqx} 1 " A }x} 1 directly. We next consider the case h ď H ´m. Note that for τ h:H " po h , a h , ¨¨¨, o H´m , a H´m , ¨¨¨, o H q, we can denote t H´m`1 " po H´m`1 , a H´m`1 , ¨¨¨, o H q, then similar to (25) we have B H:h pτ h:H q " e J t H´m`1 M H´m`1 « H´m ź l"h T l,a l diag pO l po l |¨qq ff M 6 h " ÿ s h ,¨¨¨,s H´m`1 Ppt H´m`1 |s H´m`1 q « H´m ź l"h T l,a l ps l`1 |s l qO l po l |s l q ff e J s h M 6 h " ÿ sPS Ppτ h:H |s h " sqe J s M 6 h . Therefore, for policy π and trajectory τ h:H , it holds that πpτ h:H q ˆBH:h pτ h:H qx " ÿ sPS P π pτ h:H |s h " sq ˆeJ s M 6 h x,

and this gives }B

H:h x} Π ď › › ›M 6 h x › › › 1 directly. Besides, we similarly have ÿ o,a }B h po, aqv} 1 " ÿ o,a › › ›Mh`1Th,a diag pO h po|¨qqM 6 h x › › › 1 ď A |U A,h`1 | › › ›M 6 h x › › › 1 . The proof is completed by combining the two cases above.  : S Ñ R ds,1 , ψ h : S ˆA Ñ R ds,2 , φ h : O ˆS Ñ R do q h if there exists A h P R ds,1ˆds,2 , u h P R d , v P R ds,1 such that T h ps 1 |s, aq " ϕ h ps 1 q J A h ψ h ps, aq, µ 1 psq " xv, ϕ 0 psqy , O h po|sq " xu h , φ h po|sqy . We further assume a standard normalization condition: For R :" max td s,1 , d s,2 , d o u, ÿ s 1 › › ϕ h ps 1 q › › 1 ď R, }ψ h ps, aq} 1 ď R, ÿ o }φ h po|sq} 1 ď R, }A h } 8,8 ď R, }v} 8 ď R, }u h } 8 ď R. Proposition D.15. Suppose that Θ is the set of models that are linear with respect to a given Ψ and have parameters bounded by R. or learning an ε-optimal policy in Λ B -stable linear POMDPs (which include e.g. revealing and decodable linear POMDPs). This result significantly improves over the result extracted from (Zhan et al., 2022, Corollary 6.5): Assuming their α psr -regularity, we have Λ B ď ? U A α ´1 psr (Example 8) and thus obtain a sample complexity of r O `min td s,1 , d s,2 upd s,1 d s,2 `do qAU 2 A H 3 {pα 2 psr ε 2 q ˘. This only scales with d 3 AU 2 A (where d ě max td s,1 , d s,2 , d o u), whereas their results involve much larger polynomial factors of all three parameters. Further, apart from the dimension-dependence, their covering number scales with an additional log O (and thus their result does not handle extremely large observation spaces). Proof of Proposition D.15. In the following, we generalize the construction of optimistic covering of Θ using the optimistic covering of We demonstrate how to construct a ρ-optimistic covering for Θ h;o ; the construction for Θ h;t is essentially the same. In the following, we follow the idea of (Chen et al., 2022, Proposition H.15) . Fix a h P rHs and set N " rR{ρs. Let R 1 " N ρ, for u P r´R 1 , R 1 s do , we define the ρ-neighborhood of u as B 8 pu, ρq :" ρ tu{ρu `r0, ρs d , and let r O h;u po|sq :" max u 1 PB8pu,ρq @ u 1 , φ h po|sq D . Then Proof of Lemma D.16. Fix a ρ 1 P p0, 1s and let ρ " ρ 1 {3H. Note that given a tuple of parameters pr µ 1 , r T, r Oq (not necessarily induce a POMDP model), we can define r P as r Ppτ H q " ÿ s1,¨¨¨,s H r µ 1 ps 1 q r O 1 po 1 |s 1 q r T 1 ps 2 |s 1 , a 1 q ¨¨¨r T H´1 ps H |s H´1 , a H´1 q r Opo H |s H q, and r P π pτ H q " πpτ H q ˆr Ppτ H q. Then for a tuple of parameters pµ 1 , T, Oq that induce a POMDP such that }r µ 1 ´µ1 } 1 ď ρ 2 , max s,a,h › › ›p r T h ´Th qp¨|s, aq › › › 1 ď ρ 2 , max s,h › › ›p r O h ´Oh qp¨|sq › › › 1 ď ρ 2 , it holds that › › › r P π p¨q ´Pπ p¨q › › › 1 " ÿ τ H ˇˇr P π pτ H q ´Pπ pτ H q ˇď ÿ s 1:H ,τ H # πpτ H q |r µ 1 ps 1 q ´µ1 ps 1 q| r O 1 po 1 |s 1 q r T 1 ps 2 |s 1 , a 1 q ¨¨¨r Opo H |s H q `πpτ H qµ 1 ps 1 q ˇˇr O 1 po 1 |s 1 q ´Oh po 1 |s 1 q ˇˇr T 1 ps 2 |s 1 , a 1 q ¨¨¨r Opo H |s H q `πpτ H qµ 1 ps 1 qO 1 po 1 |s 1 q ˇˇr T 1 ps 2 |s 1 , a 1 q ´T1 ps 2 |s 1 , a 1 q ˇˇ¨¨¨r Opo H |s H q `¨¨π pτ H qµ 1 ps 1 qO 1 po 1 |s 1 qT 1 ps 2 |s 1 , a 1 q ¨¨¨ˇˇr Opo H |s H q ´Opo H |s H q ˇˇ+ p˚q ď 2Hρ 2 p1 `ρ2 q 2H ď 4Hρ 2 ď ρ 2 1 , where (*) is because ř s h`1 r T h ps h`1 |s h , a h q ď 1`ρ 2 and ř o h O h po h |s h q ď 1`ρ 2 for all h, s h , a h . Therefore, suppose that for each h, p r T h , Θ 1 h;t q is a ρ-optimistic covering of Θ h;t , and p r O h , Θ 1 h;o q is a ρ-optimistic covering of Θ h;o , then we can obtain a ρ 1 -optimistic covering p r P, Θ 1 q of Θ, where Θ 1 " Θ 1 0;t ˆΘ1 1;o ˆΘ1 1;t ˆ¨¨¨ˆΘ 1 H´1;t H;o . This completes the proof.

D.3.5 DECODABLE POMDPS

To construct a B-representation for the decodable POMDP, we introduce the following notation. For h ď H ´m, we consider t h " po h , a h , ¨¨¨, o h`m´1 q P U h , t h`1 " po 1 h`1 , a 1 h`1 , ¨¨¨, o 1 h`m q P U h`1 , and define P h pt h`1 |t h q " $ & % Ppo h`m " o 1 h`m |s h`m´1 " ϕ h`m´1 pt h q, a h`m´1 q, if o h`1:h`m´1 " o 1 h`1:h`m´1 and a h`1:h`m´2 " a 1 h`1:h`m´2 , 0, otherwise, where ϕ h`m´1 is the decoder function that maps t h to a latent state s h`m´1 . Similarly, for h ą H ´m, t h P U h , t h`1 P U h`1 , we let P h pt h`1 |t h q be 1 if t h ends with t h`1 , and 0 otherwise. Under such definition, for all h P rHs, t h P U h , t h`1 P U h`1 , it is clear that P h pt h`1 |t h q " Ppt h`1 |t h , τ h´1 q (27) for any reachable pτ h´1 , t h q, because of decodability. Hence, we can interpret P h pt h`1 |t h q as the probability of observing t h`1 conditional on observing t h on step h.foot_18 Then, for h P rHs, we can take B h po, aq " r1ppo, aq Ñ t h qP h pt h`1 |t h qs pt h`1 ,t h qPU h`1 ˆUh , ( ) where 1ppo, aq Ñ t h q is 1 if t h starts with po, aq and 0 otherwisefoot_19 . We verify that (28) indeed gives a B-representation for decodable POMDPs: Proposition D.17 (Decodable POMDPs are B-stable). ( 28) gives a B-stable B-representation of the m-step decodable POMDP, with Λ B " 1. The results above already guarantee the sample complexity of EXPLORATIVE E2D for decodable POMDPs. For OMLE, we can similarly obtain that ř o,a }B h po, aqx} 1 " A }x} 1 , and thus we can take R B " A. Combining this fact with Theorem 9 establishes the sample complexity of OMLE as claimed in Section 5. Proof of Proposition D.17. We verify that (28) gives a B-representation for decodable POMDP: Note that for h P rH ´1s, po h , a h q P O ˆA, t h`1 P U h`1 , there is a unique element t h P U h such that t h is the prefix of the trajectory po h , a h , t h`1 q, and it holds that e J t h`1 B h po h , a h qx " P h pt h`1 |t h q ˆxpt h q. Applying this equality recursively, we obtain the following fact: For trajectory τ h 1 :h and t h`1 P U h`1 , pτ h 1 :h , t h`1 q has a prefix t h 1 P U h 1 , and e J t h`1 B h:h 1 pτ h 1 :h qx " Ppτ h 1 :h , t h`1 |t h 1 q ˆxpt h 1 q, ( ) where Ppτ h 1 :h , t h`1 |t h 1 q stands for the probability of observing pτ h 1 :h , t h`1 q conditional on observing t h 1 at step h 1 , which is well-defined due to decodability (similar to ( 27)). Taking h 1 " 1 and x " q 0 in (29), we have for any history τ h and t h`1 P U h`1 that Ppτ h , t h`1 q " e J t h`1 B h:1 pτ h qq 0 . Therefore, (28) indeed gives a B-representation of the decodable POMDP. Furthermore, we can take h " H in (29) to obtain that: For any trajectory τ h:H " po h , a h , ¨¨¨, o H , a H q, it has a prefix t h P U h , and B H:h pτ h:H qx " Ppτ h:H |t h q ˆxpt h q. Hence, for any policy π, it holds that ÿ τ h:H πpτ h:H q ˆ|B H:h pτ h:H qx| " ÿ τ h:H P π pτ h:H |t h q ˆ|xpt h q| " ÿ t h PU h πpt h q ˆ|xpt h q| . Therefore, }B H:h x} Π ď }x} Π always. This completes the proof of Proposition D.17. et al., 2022, Lemma 6) , the PSR admits a Brepresentation such that rowspanpB h po, aqq Ă colspanpD h´1 q. In the following, we show that such a B-representation is indeed what we want. Fix a core matrix K h´1 of D h´1 , and suppose that K h´1 " " qpτ 1 h´1 q, ¨¨¨, qpτ d h´1 q ‰ with d " rankpD h q. Then it holds that }B H:h x} Π " max π ÿ τ h:H πpτ h:H q ˆ|B H:h pτ h:H qx| " max π ÿ τ h:H πpτ h:H q ˆˇˇB H:h pτ h:H qK h´1 K : h´1 x ˇď max π ÿ τ h:H πpτ h:H q ˆd ÿ j"1 |B H:h pτ h:H qK h´1 e j | ¨ˇˇe J j K : h´1 x ˇ" max π d ÿ j"1 ˇˇe J j K : h´1 x ˇˇˆÿ τ h:H πpτ h:H q ˆd ÿ j"1 ˇˇBH:hpτh:H qqpτ j h´1 q ˇˇ. Notice that B H:h pτ h:H qqpτ j h´1 q " Ppτ h:H |τ j h´1 q by Corollary D.2, and hence for any policy π, we have ÿ τ h:H πpτ h:H q ˆˇˇB H:h pτ h:H qqpτ j h´1 q ˇˇ" ÿ τ h:H P π pτ h:H |τ j h´1 q " 1 Therefore, it holds that }B H:h x} Π ď › › ›K : h´1 x › › › 1 for h P rHs and any core matrix K h´1 of D h´1 . Similarly, we can pick a core matrix K h´1 such that }K : h´1 } 1 ď α ´1 psr , then ÿ o,a }B h po, aqx} 1 " ÿ o,a › › ›Bhpo, aqK h´1 K : h´1 x › › › 1 ď A |U A,h`1 | › › ›K : h´1 x › › › 1 ď α ´1 psr AU A }x} 1 . This completes the proof.

D.4 COMPARISON WITH WELL-CONDITIONED PSRS

Concurrent work by Liu et al. (2022b) defines the following class of well-conditioned PSRs. Definition D.19. A PSR is γ-well-conditioned if it admits a B-representation such that for all h P rHs, policy π (that starts at step h), vector x P R U h , the following holds: ÿ τ h:H πpτ h:H q ˆ|B H po H , a H q ¨¨¨B h po h , a h qx| ď 1 γ }x} 1 , ( ) ÿ o h ,a h πpa h |o h q ˆ}B h po h , a h qx} 1 ď 1 γ }x} 1 . By ( 30) and the inequality }x} 1 ď ? U A }x} ˚(Lemma D.6), any γ-well-conditioned PSR is a B-stable PSR with Λ B ď ? U A γ ´1. Plugging this into our main results shows that, for wellconditioned PSRs, OMLE, EXPLORATIVE E2D and MOPS all achieve sample complexity r O ˆdAU 2 A H 2 log N Θ γ 2 ε 2 ˙, which is better than the sample complexityfoot_20 r O `d2 A 5 U 3 A H 4 log N Θ {γ 4 ε 2 ˘achieved by the analysis of OMLE in Liu et al. (2022b) . Also, being well-conditioned imposes the extra restriction (31) on the structure of the PSR, while our B-stability condition does not.

E DECORRELATION ARGUMENTS

In this section, we present two decorrelation propositions: the generalized ℓ 2 -Eluder argument (Proposition E.1), and the decoupling argument (Proposition E.6). These two propositions are important steps in the proof of main theorems (Theorem 9, 10, H.4, H.6 ). These two Propositions are parallel: Proposition E.1 is the triangular-to-diagonal version of the decorrelation used in the proof of Theorem 9 (see Appendix G for its proof), whereas Proposition E.6 is the expectation-toexpectation version of the decorrelation used in the proof of Theorem 10 (see Appendix I for its proof).

E.1 GENERALIZED ℓ 2 -ELUDER ARGUMENT

We first present the triangular-to-diagonal version of the decorrelation argument, the generalized ℓ 2 -Eluder argument. Proposition E.1 (Generalized ℓ 2 -Eluder argument). Suppose we have sequences of vectors tx k,i u pk,iqPrKsˆI Ă R d , ty k,j,r u pk,j,rqPrKsˆrJsˆR Ă R d where I, R are arbitrary (abstract) index sets. Consider functions tf k : R d Ñ Ru kPrKs : f k pxq :" max rPR J ÿ j"1 |xx, y k,j,r y| . Assume that the following condition holds: k´1 ÿ t"1 E i"qt " f k px t,i q 2 ‰ ď β k , @k P rKs, where pq k P ∆pIqq kPrKs is a family of distributions over I. Then for any M ą 0, it holds that k ÿ t"1 M ^Ei"qt rf t px t,i qs ď g f f e 2d ´M 2 k `k ÿ t"1 β t ¯log ˆ1 `k d R 2 x R 2 y M 2 ˙, @k P rKs, where R 2 x " max k E i"q k r}x k,i } 2 2 s, R y " max k,r ř j }y k,j,r } 2 . We call this proposition "generalized ℓ 2 -Eluder argument" because, when I is a single element set and β k " β, the result reduces to if ÿ tăk f k px t q 2 ď β, for all k P rKs, then k ÿ t"1 |f t px t q| ď r O ´adβk ¯, as long as max t |f t px t q| ď 1, which implies that the function class tf t u t has Eluder dimension r O pdq. In particular, when tf k u kPrKs is given by f k pxq " |xy k , xy|, (32) is equivalent to the standard ℓ 2 -Eluder argument for linear functions, which can be proved using the elliptical potential lemma (Lattimore & Szepesvári, 2020, Lemma 19.4 ). In the following, we present a corollary of Proposition E.1 that is more suitable for our applications. Published as a conference paper at ICLR 2023 Corollary E.2. Suppose we have a sequence of functions tf k : R n Ñ Ru kPrKs : f k pxq :" max rPR J ÿ j"1 |xx, y k,j,r y| , which is given by the family of vectors ty k,j,r u pk,j,rqPrKsˆrJsˆR Ă R n . Further assume that there exists L ą 0 such that f k pxq ď L }x} 1 . Consider further a sequence of vector px i q iPI , satisfying the following condition k´1 ÿ t"1 E i"qt " f 2 k px i q ‰ ď β k , @k P rKs, and the subspace spanned by px i q iPI has dimension at most d. Then it holds that k ÿ t"1 1 ^Ei"qt rf t px i qs ď g f f e 4d ´k `k ÿ t"1 β t ¯log ´1 `kdL max i }x i } 1 ¯, @k P rKs. We prove Proposition E.1 and Corollary E.2 in the following subsections. Remark E.3. In the initial version of this paper, the statement of Corollary E.2 was slightly different from above, which states that under the same precondition, k ÿ t"1 1 ^Ei"qt rf t px i qs ď g f f e 4d ´k `k ÿ t"1 β t ¯log p1 `kdLκ d pXqq, @k P rKs, where matrix X :" rx i s iPI P R nˆI and κ d pXq " min ␣ }F 1 } 1 }F 2 } 1 : X " F 1 F 2 , F 1 P R nˆd , F 2 P R dˆI ( . After our initial version, we noted the concurrent work Liu et al. (2022b, Lemma G.3 ) which essentially shows that κ d pXq ď d max i }x i } 1 by an elegant argument using the Barycentric spanner. For the sake of simplicity, we have applied their result (cf. Lemma E.5) to make Corollary E.2 slightly more convenient to use. We also note that, in the initial version of this paper, in the statement of Theorem 9, the sample complexity involved a log factor ι :" logp1 `KdΛ B R B κ d q, where κ d :" max h κ d pD h q, which we then tightly bounded for all concrete problem classes in terms of the corresponding problem parameter. The above change makes the statement slightly cleaner (though the result slightly looser) by always using the bound κ d ď dU A . The effect on the final result is however minor, as the sample complexity of OMLE only depends on κ d logarithmically through ι, and the sample complexity of MOPS or EXPLORATIVE E2D does not involve this factor.

E.1.1 PROOF OF PROPOSITION E.1

To prove this proposition, we first show that the proposition can be reduced to the case when n " 1 , extending the idea of the proof of (Liu et al., 2022a, Proposition 22) . After that, we invoke a certain variant of the elliptical potential lemma to derive the desired inequality. We first transform and reduce the problem. For every pair of pk, iq P rKs ˆI, we take r ˚pk, iq :" arg max r ř j |xx k,i , y k,j,r y|, and consider r y k,i,j :" y k,j,r ˚pk,iq @pk, i, jq P rKs ˆI ˆrns. We then define r y k,i :" ÿ j r y k,i,j sign xr y k,i,j , x k,i y @pk, iq P rKs ˆI. Under such a transformation, it holds that for all t, k, i, i 1 , |xx t,i , r y t,i y| " ÿ j ˇˇ@ x t,i , y t,j,r ˚pt,iq Dˇˇ" max r ÿ j |xx t,i , y t,j,r y| " f t px t,i q, |xx t,i , r y k,i 1 y| ď max r ÿ j |xx t,i , y k,j,r y| " f k px t,i q, }r y k,i } 2 ď R y . Therefore, it remains to bound ř k t"1 M ^Ei"qt |xx t,i , r y t,i y|, under the condition that for all k P rKs, ř tăk E i"qt rmax i 1 |xx t,i , r y k,i 1 y| 2 s ď β k . To show this, we define Φ t :" E i"qt " x t,i x J t,i ‰ , and take λ 0 " M 2 R 2 y , V k :" λ 0 I `řtăk Φ t . Then k ÿ t"1 M ^Ei"qt |xx t,i , r y t,i y| ď k ÿ t"1 min ! M, E i"qt " }x t,i } V ´1 t }r y t,i } Vt ı) ď k ÿ t"1 min " M, c pM 2 `βt qE i"qt " }x t,i } 2 V ´1 t ı * ď k ÿ t"1 c pM 2 `βt q min ! 1, E i"qt " }x t,i } 2 V ´1 t ı) ď ´kM 2 `k ÿ t"1 β t ¯1 2 ´k ÿ t"1 min ! 1, E i"qt " }x t,i } 2 V ´1 t ı)¯1 2 , where the second inequality is due to the fact that for all pt, iq, }r y t,i } 2 Vt " λ 0 }r y t,i } 2 `ÿ săt E i 1 "qs |xx s,i 1 , r y t,i y| 2 ď M 2 `βt . Note that E i"qt " }x t,i } 2 V ´1 t ı " E i"qt " tr ´V ´1 2 k x k,i x J k,i V ´1 2 k ¯ı " tr ´V ´1 2 k Φ k V ´1 2 k ¯. In order to bound the term ř k t"1 mint1, trpV ´1{2 k Φ k V ´1{2 k qu, we invoke the following standard lemma, which generalizes Lattimore & Szepesvári (2020, Lemma 19.4) . Lemma E.4 (Generalized elliptical potential lemma). Let tΦ k P R dˆd u kPrKs be a sequence of symmetric semi-positive definite matrix, and V k :" λ 0 I `řtăk Φ t , where λ 0 ą 0 is a fixed real. Then it holds that K ÿ k"1 min ! 1, tr ´V ´1 2 k Φ k V ´1 2 k ¯) ď 2d log ˜1 `řK k"1 trpΦ k q dλ 0 ¸. Applying Lemma E.4 and noticing trpΦ t q " E i"qt r}x t,i } 2 2 s ď R 2 x , the proof of Proposition E.1 is completed. Proof of Lemma E.4. By definition and by linear algebra, we have V k`1 " V 1 2 k ´I `V ´1 2 k Φ k V ´1 2 k ¯V 1 2 k , and hence detpV k`1 q " detpV k qdetpI `V ´1 2 k Φ k V ´1 2 k q. Therefore, we have K ÿ k"1 min ! 1, tr ´V ´1 2 k Φ k V ´1 2 k ¯) ď K ÿ k"1 2 log ´1 `tr ´V ´1 2 k Φ k V ´1 2 k ¯ď2 K ÿ k"1 log det ´1 `V ´1 2 k Φ k V ´1 2 k "2 K ÿ k"1 rlog detpV k`1 q ´log detpV k qs "2 log detpV K`1 q detpV 0 q , where the first inequality is due to the fact that min t1, uu ď 2 logp1 `uq, @u ě 0, and the second inequality is because for any positive semi-definite matrix X, it holds detpI `Xq ě 1 `trpXq. Now, we have log detpV K`1 q ď log ˆtrpV K`1 q d ˙d " d log ˜λ0 `řK k"1 trpΦ k q d ¸, which completes the proof of Lemma E.4. E.1.2 PROOF OF COROLLARY E.2 Let us take a decomposition x i " F v i @i P I, such that }v i } 8 ď 1 and }F } 1Ñ1 ď max i }x i } 1 (the existence of such a decomposition is guaranteed by Lemma E.5). We define r f k : R d Ñ R as follows: r f k pvq :" f k pF vq " max r ÿ j ˇˇ@ v, F J y k,j,r Dˇˇ. By definition, r f k pv i q " f k px i q, and hence our condition becomes ÿ tăk E i"qt " r f 2 k pv i q ı ď β k , @k P rKs, Then applying Proposition E.1 gives for all k P rKs, k ÿ t"1 1 ^Ei"qt rf t px i qs " ÿ t"1 1 ^Ei"qt " r f t pv i q ı ď g f f e 2d ´k `k ÿ t"1 β t ¯log p1 `kd ´1 ¨R2 2 R 2 1 q, where R 2 ď max i }v i } 2 ď ? d, and R 1 " max k,r ÿ j › › F J y k,j,r › › 2 ď max k,r ÿ j › › F J y k,j,r › › 1 ď max k,r ÿ j d ÿ m"1 ˇˇe J m F J y k,j,r ˇ" max k,r ÿ j d ÿ m"1 |xF e m , y k,j,r y| ď max k d ÿ m"1 f k pF e m q ď d ÿ m"1 L }F e m } 1 ď dL }F 1 } 1 ď dL max i }x i } 1 . Therefore, we have log `1 `kd ´1 ¨R2 1 R 2 2 ˘ď log ´1 `kd 2 L 2 max i }x i } 2 1 ¯ď 2 logp1 `kdL max i }x i } 1 q, which completes the proof of Corollary E.2. The following lemma is an immediate consequence of Liu et al. (2022b, Lemma G.3) . Lemma E.5. Assume that a sequence of vectors tx i u iPI Ă R n satisfies that spanpx i : i P Iq has dimension at most d and R " max i }x} 1 ă 8. Then, there exists a sequence of vectors tv i u iPI Ă R d and a matrix F P R nˆd , such that x i " F v i @i P I, and }v i } 8 ď 1, }F } 1Ñ1 ď R. Proof. Without loss of generality, we assume that X " spanpx i : i P Iq has dimension at most d. Then X is a d-dimensional compact subset of R n , and we take a Barycentric spanner of X to be tw 1 , ¨¨¨, w d u. By definition, for each i P I, there exists weights pα ij q 1ďjďd such that α ij P r´1, 1s and x i " ř d j"1 α ij w j . Therefore, we can take v i " rα ij s J 1ďjďd P R d and F " rw 1 , ¨¨¨, w d s P R nˆd , and they clearly fulfill the statement of Lemma E.5.

E.2 DECOUPLING ARGUMENT

Proposition E.1 can be regarded a triangular-to-diagonal decorrelation result. In this section, we present its expectation-to-expectation analog, which is central for bounding Explorative DEC.

Proposition E.6 (Decoupling argument). Suppose we have vectors and functions

tx i u iPI Ă R n , tf θ : R n Ñ Ru θPΘ where Θ, I are arbitrary abstract index sets, with functions f θ given by f θ pxq :" max rPR J ÿ j"1 |xx, y θ,j,r y| , @x P R n , where ty θ,j,r u pθ,j,rqPΘˆrJsˆR Ă R n is a family of bounded vectors in R n . Then for any distribution µ over Θ and probability family tq θ u θPΘ Ă ∆pIq, E θ"µ E i"q θ rf θ px i qs ď b d X E θ,θ 1 "µ E i"q θ 1 rf θ px i q 2 s, where d X is the dimension of the subspace of R n spanned by px i q iPI . Proof of Proposition E.6. By the assumption that ty θ,j,r u pθ,j,rq is a family of bounded vectors in R d , there exists R y ă 8 such that sup θ,r ř n j"1 }y θ,j,r } ď R y . We follow the same two steps as the proof of Proposition E.1. First, we reduce the problem. We consider r ˚pθ, iq " arg max rPR ř j |xx i , y θ,j,r y|, and define the vectors r y θ,i,j " y θ,j,r ˚pθ,iq , r y θ,i " ÿ j sign xx i , r y θ,i,j y r y θ,i,j . Then for all i P I, θ P Θ, xx i , r y θ,i y " ÿ j |xx i , r y θ,i,j y| " ÿ j ˇˇ@ x i , y θ,j,r ˚pθ,iq Dˇˇ" f θ px i q, |xx i , r y θ 1 ,i 1 y| ď ÿ j |xx i , r y θ 1 ,i 1 ,j y| " ÿ j ˇˇ@ x i , y θ 1 ,j,r ˚pθ 1 ,i 1 q Dˇˇď f θ 1 px i q, and }r y θ,i } 2 ď ř j › › y θ,j,r ˚pθ,iq › › 2 ď R y . Therefore, it suffices to bound E θ"µ E i"q θ r|xx i , r y θ,i y|s. Next, we define Φ λ :" λ `Eθ"µ E i"q θ " x i x J i ‰ with λ ą 0. Then we can bound the target as E θ"µ E i"q θ r|xx i , r y θ,i y|s ďE θ"µ E i"q θ " }x i } Φ ´1 λ }r y θ,i } Φ λ ı ď " E θ"µ E i"q θ }x i } 2 Φ ´1 λ ı 1{2 " E θ"µ E i"q θ }r y θ,i } 2 Φ λ ı 1{2 . The first term can be rewritten as E θ"µ E i"q θ " }x i } 2 Φ ´1 λ ı "E θ"µ E i"q θ " tr ´Φ´1{2 λ x i x J i Φ ´1{2 λ ¯ı "tr ´Φ´1{2 λ E θ"µ E i"q θ " x i x J i ‰ Φ ´1{2 λ "tr ´Φ´1{2 λ Φ 0 Φ ´1{2 λ ¯ď rankpΦ 0 q ď d X . The second term can be bounded as E θ"µ E i"q θ }r y θ,i } 2 Φ λ "E θ 1 "µ E i 1 "q θ 1 }r y θ 1 ,i 1 } 2 Φ λ "E θ 1 "µ E i 1 "q θ 1 ! ďE θ 1 "µ E θ"µ E i"q θ " |f θ 1 px i q| 2 ı `λR 2 y , where the last inequality is due to (33). Letting λ Ñ 0 `completes the proof of Proposition E.6.

F STRUCTURAL PROPERTIES OF B-STABLE PSRS

In this section, we present two important propositions that are used in the proofs of all the main theorems (Theorem 9, 10, H.4, H.6). The first proposition bounds the performance difference of two PSR models by B-errors. The second proposition bounds the squared B-errors by the Hellinger distance of observation probabilities between two models.

F.1 PERFORMANCE DECOMPOSITION

We first present the performance decomposition proposition. Proposition F.1 (Performance decomposition). Suppose that two PSR models θ, θ admit ttB θ h po h , a h qu h,o h ,a h , q θ 0 u and ttB θ h po h , a h qu h,o h ,a h , q θ 0 u as B-representation respectively, and suppose that tB θ H:h u hPrHs and tB θ H:h u hPrHs are the associated B-operators respectively. Define E θ θ,h pτ h´1 q :" 1 2 max π ÿ o h ,a h πpa h |o h q › › ›B θ H:h`1 ´Bθ h po h , a h q ´Bθ h po h , a h q ¯qθ pτ h´1 q › › › Π , E θ θ,0 :" 1 2 › › ›B θ H:1 ´qθ 0 ´qθ 0 ¯› › › Π . Then it holds that D TV `Pπ θ , P π θ ˘ď E θ θ,0 `H ÿ h"1 Eθ ,π " E θ θ,h pτ h´1 q ı , where for h P rHs, the expectation Eθ ,π is taking over τ h´1 under model θ and policy π. Proof of Proposition F.1. By the definition of B-representation, we have P π θ pτ H q " πpτ H q Bθ H:1 pτ H qq θ 0 for PSR model θ. Then for two different PSR models θ, θ, we have P π θ pτ H q ´Pπ θ pτ H q " πpτ H q ˆ"B θ H:1 pτ 1:H qq θ 0 ´Bθ H:1 pτ 1:H qq θ 0 ı " πpτ H q ˆBθ H:1 pτ 1:H q ´qθ 0 ´qθ 0 πpτ H q ˆH ÿ h"1 B θ H:h`1 pτ h`1:H q ´Bθ h po h , a h q ´Bθ h po h , a h q ¯Bθ 1:h´1 pτ h´1 qq θ 0 " πpτ H q ˆBθ H:1 pτ H q ´qθ 0 ´qθ 0 H ÿ h"1 πpτ h:H q ˆBθ H:h`1 pτ h`1:H q ´Bθ h po h , a h q ´Bθ h po h , a h q ¯qθ pτ h´1 q ˆPπ θ pτ h´1 q, where the last equality is due to the definition of B-representation (see e.g. ( 15)). Therefore, we have 1 2 ÿ τ H |P π θ pτ H q ´Pπ θ ‹ pτ H q| ď 1 2 ÿ τ H πpτ H q ˆˇˇB θ H:1 pτ H q ´qθ 0 ´qθ 0 ¯ˇˇ`1 2 ÿ τ H H ÿ h"1 πpτ h:H q ˆˇˇB θ H:h`1 pτ h`1:H q ´Bθ h po h , a h q ´Bθ h po h , a h q ¯qθ pτ h´1 q ˇˇˆP π θ pτ h´1 q ď 1 2 › › ›B θ H:1 ´qθ 0 ´qθ 0 ¯› › › Π `1 2 H ÿ h"1 ÿ τ h´1 P π θ pτ h´1 q ˆmax π ÿ o h ,a h πpa h |o h q › › ›B θ H:h`1 pτ h`1:H q ´Bθ h po h , a h q ´Bθ h po h , a h q ¯qθ pτ h´1 q › › › Π " E θ θ,0 `H ÿ h"1 Eθ ,π " E θ θ,h pτ h´1 q ı , where the last inequality is due to the definition of Eθ ,π and E θ θ,h pτ h´1 q. F.2 BOUNDING THE SQUARED B-ERRORS BY HELLINGER DISTANCE In the following proposition, we show that under B-stability or weak B-stability, the squared Berrors can be bounded by the Hellinger distance between P π h,exp θ and P π h,exp θ . Here, for a policy π P Π and h P rHs, π h,exp is defined as π h,exp :" π ˝h UnifpAq ˝h`1 UnifpU A,h`1 q, ( ) which is the policy that follows π for the first h ´1 steps, takes UnifpAq at step h, takes an action sequence sampled from UnifpU A,h`1 q at step h`1, and behaves arbitrarily afterwards. This notation is consistent with the exploration policy in the OMLE algorithm (Algorithm 1). Proposition F.2 (Bounding squared B-errors by squared Hellinger distance). Suppose that the Brepresentation of θ is Λ B -stable (cf. Definition 4) or weakly Λ B -stable (cf. Definition D.4), then we have for h P rH ´1s Eθ ,π " E θ θ,h pτ h´1 q 2 ı ď 4Λ 2 B AU A " π0,exp θ ˘, where E θ θ,h pτ h´1 q and E θ θ,0 are as defined in Proposition F.1. Proof of Proposition F.2. We first deal with the case h P rHs. By taking the difference, we have 2E θ θ,h pτ h´1 q " max π ÿ τ h:H πpτ h:H q ˆˇˇB θ H:h`1 pτ h`1:H q ´Bθ h po h , a h q ´Bθ h po h , a h q ¯qθ pτ h´1 q ˇď max π ÿ τ h:H πpτ h:H q ˆˇˇB θ H:h pτ h:H q ´qθ pτ h´1 q ´qθ pτ h´1 q ¯ˇm ax π ÿ τ h:H πpτ h:H q ˆˇˇB θ H:h`1 pτ h`1:H q ´Bθ h po h , a h qq θ pτ h´1 q ´Bθ h po h , a h qq θ pτ h´1 q ¯ˇ" › › ›B θ H:h ´qθ pτ h´1 q ´qθ pτ h´1 q ¯› › › Π `max π h ÿ o h ,a h π h pa h |o h q › › ›B θ H:h`1 ´Bθ h po h , a h qq θ pτ h´1 q ´Bθ h po h , a h qq θ pτ h´1 q ¯› › › Π . We now introduce several notations for the convenience of the proof. 1. For an action sequence a of length lpaq, Pp¨|τ h´1 , dopaqq stands for the distribution of o h:h`lpaq conditional on τ h´1 and taking action a for step h to step h `lpaq ´1. 2. Given a set A of action sequences (possibly of different length), P UnifpA q p¨|τ h´1 q stands for the distribution of observation generated by: conditional on τ h´1 , first sample a a " UnifpU A,h q, then take a and then observe o (of length lpaq `1). By the definition of Hellinger distances and by the notations above, we have D 2 H ´PUnifpA q θ p¨|τ h´1 q, P UnifpA q θ p¨|τ h´1 q ¯" 1 |A | ÿ aPA D 2 H pP θ p¨|τ h´1 , dopaqq, Pθp¨|τ h´1 , dopaqqq . Next, we present two lemmas whose proof will be deferred after the proof of the proposition. Lemma F.3. Suppose that B is weakly Λ B -stable (Λ B -stable is a sufficient condition), then it holds that › › ›B θ H:h ´qθ pτ h´1 q ´qθ pτ h´1 q ¯› › › Π ď 2Λ B b |U A,h |D H ´PUnifpU A,h q θ p¨|τ h´1 q, P UnifpU A,h q θ p¨|τ h´1 q ¯. Lemma F.4. Suppose that B is weakly Λ B -stable (Λ B -stable is a sufficient condition), then it holds that max π h ÿ o h ,a h π h pa h |o h q › › ›B θ H:h`1 ´Bθ h po h , a h qq θ pτ h´1 q ´Bθ h po h , a h qq θ pτ h´1 q ¯› › › Π ď 2Λ B b A |U A,h`1 |D H ´PUnifpAq˝UnifpU A,h`1 q θ p¨|τ h´1 q, P UnifpAq˝UnifpU A,h`1 q θ p¨|τ h´1 q ¯. Therefore, we first consider the case h P rH ´1s. Applying Lemma F.3 and taking expectation with respect to τ h´1 , we obtain Eθ ,π " › › ›B θ H:h ´qθ pτ h´1 q ´qθ pτ h´1 q ¯› › › 2 Π ȷ ď 4Λ 2 B |U A,h | Eθ ,π " D 2 H ´PUnifpU A,h q θ p¨|τ h´1 q, P UnifpU A,h q θ p¨|τ h´1 q ¯ı ď 8Λ 2 B |U A,h | D 2 H ´Pπ˝hUnifpU A,h q θ , P π˝hUnifpU A,h q θ ď 8Λ 2 B A |U A,h | D 2 H `Pπ h´1,exp θ , P π h´1,exp θ ˘, where the second inequality is due to Lemma C.1, and the last inequality is due to importance sampling. Similarly, applying Lemma F.4 and taking expectation with respect to τ h´1 , we have Eθ ,π » -˜max π h ÿ o h ,a h π h pa h |o h q › › ›B θ H:h`1 ´Bθ h po h , a h qq θ pτ h´1 q ´Bθ h po h , a h qq θ pτ h´1 q ¯› › › Π ¸2fi fl ď8Λ 2 B A |U A,h`1 | D 2 H ´Pπ˝hUnifpAq˝UnifpU A,h q θ , P π˝hUnifpAq˝UnifpU A,h q θ "8Λ 2 B A |U A,h`1 | D 2 H `Pπ h,exp θ , P π h,exp θ ˘. The proof for h P rH ´1s is completed by noting that px`yq 2 ď 2x 2 `2y 2 and U A " max h |U A,h |. For  ´qθ pτ H´1 q ´qθ pτ H´1 q ¯› › › Π ď2Λ B b |U A,H |D H ´PUnifpU A,H q θ p¨|τ H´1 q, P UnifpU A,H q θ p¨|τ H´1 q "2Λ B D H pP θ p¨|τ H´1 q, Pθp¨|τ H´1 qq , where the equality is due to U A,H only containing the null action sequence. Therefore, E θ θ,H pτ H´1 q ď pΛ B `1qD H `Pπ H´1,exp θ , P π H´1,exp θ ˘, and applying Lemma C.1 completes the proof of the case h " H. The case h " 0 is directly implied by Lemma F.3: › › ›B θ H:1 ´qθ 0 ´qθ 0 ¯› › › 2 Π ď4Λ 2 B |U A,1 | D 2 H ´Pπ˝1UnifpU A,1 q θ , P π˝1UnifpU A,1 q θ "4Λ 2 B |U A,1 | D 2 H `Pπ0,exp θ , P π0,exp θ ˘. Combining all these cases finishes the proof of Proposition F.2. We next prove Lemma F.3 and F.4 that were used in the proof of Proposition F.2. Proof of Lemma F.3. By the weak B-stability as in Definition D.4 (B-stability is also sufficient, see Eq. ( 19)), we have › › ›B θ H:h ´qθ pτ h´1 q ´qθ pτ h´1 q ¯› › › 2 Π ď2Λ 2 B ´› › q θ pτ h´1 q › › Π `› › ›q θ pτ h´1 q › › › Π ¯› › › › b q θ pτ h´1 q ´bq θ pτ h´1 q › › › › 2 2 , where }¨} Π is defined in Definition D.3. By the definition of q θ pτ h´1 q, for t h " po, aq P U h , we have q θ pτ h´1 qpo, aq " P θ pt h |τ h´1 q " P θ po h:h`lpaq´1 " o|τ h´1 , dopaqq. Hence, we have › › q θ pτ h´1 q › › Π " max T 1 ĂU h max π ÿ po,aqPU h πpo, aq ˆPθ po h:h`lpaq´1 " o|τ h´1 , dopaqq " max T 1 ĂU h max π P π θ pT 1 |τ h´1 q ď 1, where P π θ pT 1 |τ h´1 q stands for the probability that some test in T 1 is observed under θ, π conditional on τ h´1 . Similarly, we have  › › ›q θ pτ h´1 q › › › Π ď 1. Therefore, we have 1 4 Λ ´2 B › › ›B θ H:h ´qθ pτ h´1 q ´qθ pτ h´1 q ¯› › › 2 Π ď › › › › b q θ pτ h´1 q ´bq θ pτ h´1 q › › › › 2 2 " ÿ aPU A, " |U A,h | D 2 H ´PUnifpU A,h q θ p¨|τ h´1 q, P UnifpU A,h q θ p¨|τ h´1 q ¯, where in (i) we include those o such that po, aq may not belong to U h`1 into summation, (ii) is due to the definition of Pp¨|τ h´1 , dopaqq, and (iii) follows from importance sampling (35). This completes the proof of Lemma F.3. Proof of Lemma F.4. Similar to the proof of Lemma F.3, we only need to work under the weak B-stability condition. By Corollary D.2, for t h`1 " po, aq P U h`1 , it holds that " B θ h po, aqq θ pτ h´1 q ‰ po, aq " P θ pt h`1 |τ h´1 , o, aq ˆPθ po|τ h´1 q " P θ po, a, t h`1 |τ h´1 q, and hence › › B θ h po, aqq θ pτ h´1 q › › Π " max T 1 ĂU h`1 max π ÿ t h`1 PT 1 πpt h`1 q ˆPθ pt h`1 |τ h´1 , o, aq ˆPθ po|τ h´1 q " max T 1 ĂU h`1 max π P π θ pT 1 |τ h´1 , o, aq ˆPθ po|τ h´1 q ď P θ po|τ h´1 q, where P π θ pT 1 |τ h´1 , o, aq stands for the probability that some test in T 1 is observed under θ, π conditional on observing τ h " pτ h´1 , o, aq. Similarly, we have › › ›B θ h po, aqq θ pτ h´1 q › › › Π ď Pθpo|τ h´1 q. Therefore, by the weak B-stability as in Definition D.4 and combining with the inequalities above, it holds that › › ›B θ H:h`1 ´Bθ h po, aqq θ pτ h´1 q ´Bθ h po, aqq θ pτ h´1 q ¯› › › Π ď Λ B a 2rP θ `Pθspoh " o|τ h´1 q ¨" ÿ tPU h`1 ˇˇ" a P θ ´aPθ ı po, a, t|τ h´1 q ˇˇ2 ı 1{2 . Step 1. By Proposition G.2, it holds that θ ‹ P Θ. Therefore, V θ k pπ k q ě V ‹ , and by Proposition F.1, we have k ÿ t"1 `V‹ ´Vθ ‹ pπ t q ˘ď k ÿ t"1 `Vθ t pπ t q ´Vθ ‹ pπ t q ˘ď k ÿ t"1 D TV ´Pπ t θ t , P π t θ ‹ ď k ÿ t"1 1 ^˜E ‹ t,0 `H ÿ h"1 E π t " E ‹ t,h pτ h´1 q ‰ ḑ k ÿ t"1 ˜1 ^E‹ t,0 `H ÿ h"1 1 ^Eπ t " E ‹ t,h pτ h´1 q ‰ ¸. On the other hand, by Proposition F.2, we have pE ‹ t,0 q 2 `H ÿ h"1 E π t " E ‹ k,h pτ h´1 q 2 ‰ ď 12Λ 2 B AU A H´1 ÿ h"0 D 2 H `Pπ h,exp θ k , P π h,exp θ ‹ ˘. Furthermore, by Proposition G.2 we have k´1 ÿ t"1 H´1 ÿ h"0 D 2 H ˆPπ t h,exp θ k , P π t h,exp θ ‹ ˙ď 2β. Therefore, defining β k,h :" ř tăk E π t rE ‹ k,h pτ h´1 q 2 s, combining the two equations above gives H ÿ h"0 β k,h " H ÿ h"0 ÿ tăk E π t rE ‹ k,h pτ h´1 q 2 s ď 24Λ 2 B AU A β, @k P rKs. Step 2. We would like to bridge the performance decomposition (38) and the squared B-errors bound (39) using the generalized ℓ 2 -Eluder argument. We consider separately the case for h " 0 and h P rHs. Case 1: h " 0. This case follows directly from Cauchy-Schwarz inequality: k ÿ t"1 1 ^E‹ t,0 ď ´k k ÿ t"1 1 ^`E ‹ t,0 ˘2¯1 {2 ď b kpβ k,0 `1q. ( ) Case 2: h P rHs. We invoke the generalized ℓ 2 -Eluder argument (actually, its corollary) as in Appendix E.1, restated as follows for convenience. Corollary E.2. Suppose we have a sequence of functions tf k : R n Ñ Ru kPrKs : f k pxq :" max rPR J ÿ j"1 |xx, y k,j,r y| , which is given by the family of vectors ty k,j,r u pk,j,rqPrKsˆrJsˆR Ă R n . Further assume that there exists L ą 0 such that f k pxq ď L }x} 1 . Consider further a sequence of vector px i q iPI , satisfying the following condition k´1 ÿ t"1 E i"qt " f 2 k px i q ‰ ď β k , @k P rKs, and the subspace spanned by px i q iPI has dimension at most d. Then it holds that k ÿ t"1 1 ^Ei"qt rf t px i qs ď g f f e 4d ´k `k ÿ t"1 β t ¯log ´1 `kdL max i }x i } 1 ¯, @k P rKs. We have the following three preparation steps to apply Corollary E.2. 1. Recall the definition of E ‹ t,h pτ h´1 q as in Proposition F.1 (in short E ‹ k,h pτ h´1 q :" E θ ‹ θ k ,h pτ h´1 q), E ‹ k,h pτ h´1 q :" 1 2 max π ÿ τ h:H πpτ h:H q ˆˇB k H:h`1 pτ h`1:H q `Bk h po h , a h q ´B‹ h po h , a h q ˘q‹ pτ h´1 q ˇˇ, where we replace superscript θ k of B by k for simplicity. Let us define y k,j,π :" 1 2 πpτ j h:H q ˆ"B k H:h`1 pτ j h`1:H q ´Bk h po j h , a j h q ´B‹ h po j h , a j h q ¯ıJ P R |U h | , where tτ j h:H " po h , a h , ¨¨¨, o H , a H qu n j"1 is an ordering of all possible τ h:H (and hence n " pOAq H´h`1 ), π is any policy that starts at step h. We then define f k pxq " max π ÿ j |xy k,j,π , xy| , x P R U h . It follows from definition that E ‹ k,h pτ h´1 q " f k pq ‹ pτ h´1 qq. 2. We define x i " q ‹ pτ i h´1 q P R |U h | , where tτ i h´1 u i is an ordering of all possible τ h´1 P pO Âq h´1 . Then by the assumption that θ ‹ has PSR rank less than or equal to d, we have dim spanpx i : i P Iq ď d. Furthermore, we have }x i } 1 ď U A by definition. 3. It remains to verify that f k is Lipschitz with respect to 1-norm. We only need to verify it under the weak Λ B -stability condition. We have f k pqq ď 1 2 « › › B k H:h q › › Π `max π ÿ o,a πpa|oq › › B k H:h`1 B ‹ h po, aqq › › Π ff ď 2Λ B }q} 1 `2Λ B max π ÿ o,a πpa|oq }B ‹ h po, aqq} 1 ď 2Λ B }q} 1 `2Λ B ÿ o,a }B ‹ h po, aq} 1 }q} 1 ď 2Λ B pR B `1q }q} 1 , where the first inequality follows the same argument as (35); the second inequality is due to Bstability (or weak B-stability and ( 21)); the last inequality is due to the definition of B. Hence we can take L " 2Λ B pR B `1q to ensure that f k pxq ď L }x} 1 . Therefore, applying Corollary E.2 yields k ÿ t"1 1 ^Eπ t " E ‹ t,h pτ h´1 q ‰ ď g f f e 4ι ˜kd `d k ÿ t"1 β t,h ¸, where ι " log p1 `2kdU A Λ B pR B `1qq. This completes case 2. Combining these two cases, we obtain k ÿ t"1 `V‹ ´Vθ ‹ pπ t q ˘piq ď k ÿ t"1 1 ^E‹ t,0 `H ÿ h"1 ´k ÿ t"1 1 ^Eπ t " E ‹ t,h pτ h´1 q ‰ piiq ď b kpβ k,0 `1q `2? ι ¨H ÿ h"1 ´kd `d k ÿ t"1 β t,h ¯1{2 ď g f f e p4Hι `1q ¨´kpHd `1q `d k ÿ t"1 H ÿ h"0 β t,h piiiq " O ˆbΛ 2 B dAU A H ¨kβι ˙. where (i) used ( 38); (ii) used the above two cases ( 40) and ( 41); (iii) used (39). As a consequence, whenever k ě OpΛ 2 B dAU A H ¨βι{ε 2 q, we have 1 k ř k t"1 pV ‹ ´Vθ ‹ pπ t qq ď ε. This completes the proof of Theorem G.1 (and hence Theorem 9). which gives Proposition G.2(2). In the following, we establish (42). Let us fix a 1{T -optimistic covering p r P, Θ 0 q of Θ, such that n :" |Θ 0 | " N Θ p1{T q. We label p r P θ0 q θ0PΘ0 by r P 1 , ¨¨¨, r P n . By the definition of optimistic covering, it is clear that for any θ P Θ, there exists i P rns such that for all π, τ , it holds that r P π i pτ q ě P π θ pτ q and } r P π i p¨q ´Pπ θ p¨q} 1 ď 1{T 2 . We say θ is covered by this i P rns. Then, we consider ℓ t i " log P πt θ ‹ pτ t q r P πt i pτ t q , t P rT s, i P rns. By Lemma C.3, the following holds with probability at least 1 ´δ: for all t P rT s, i P rns, 1 2 t´1 ÿ s"1 ℓ s i `logpn{δq ě t´1 ÿ s"1 ´Es " exp ˆ´1 2 ℓ s i ˙ȷ, where E s denotes the conditional expectation over all randomness after πs has been determined. By definition, E t " exp ˆ´1 2 ℓ t i ˙ȷ " E t » - d r P πt i pτ t q P πt θ ‹ pτ t q fi fl " E τ "π t » - d r P πt i pτ q P πt θ ‹ pτ q fi fl " ÿ τ b P πt θ ‹ pτ q r P πt i pτ q Therefore, for any θ P Θ that is covered by i P rns, we have ´log E t " exp ˆ´1 2 ℓ t i ˙ȷ ě1 ´ÿ τ b P πt θ ‹ pτ q r P πt i pτ q "1 ´ÿ τ b P πt θ ‹ pτ qP πt θ pτ q ´ÿ τ b P πt θ ‹ pτ q ˆbr P πt i pτ q ´bP πt θ pτ q Algorithm 2 EXPLORATIVE E2D (Chen et al., 2022) Input: Model class Θ, parameters γ ą 0, η P p0, 1{2q. An 1{T -optimistic cover p r P, Θ 0 q. 1: Initialize µ 1 " UnifpΘ 0 q. 2: for t " 1, . . . , T do 3: Set pp t exp , p t out q " arg min ppexp,poutqP∆pΠq 2 p V µ t γ pp exp , p out q, where p V µ t γ is defined by p V µ t γ pp exp , p out q :" sup θPΘ E π"pout rV θ pπ θ q ´Vθ pπqs ´γE π"pexp E θ t "µ t " D 2 H pP π θ , P π θ t q ‰ .

4:

Sample π t " p t exp . Execute π t and observe τ t .

5:

Compute µ t`1 P ∆pΘ 0 q by µ t`1 pθq 9 θ µ t pθq ¨exp ´η log r P π t θ pτ t q ¯. Output: Policy p π out :" 1 T ř T t"1 p t out . ě 1 2 D 2 H ´Pπ t θ pτ " ¨q, P πt θ ‹ pτ " ¨q¯´˜ÿ τ ˇˇˇbr P πt i pτ q ´bP πt θ pτ q ˇˇˇ2 ¸1{2 ě 1 2 D 2 H ´Pπ t θ pτ " ¨q, P πt θ ‹ pτ " ¨q¯´› › › r P πt i p¨q ´Pπ t θ p¨q › › › 1{2 1 ě 1 2 D 2 H ´Pπ t θ pτ " ¨q, P πt θ ‹ pτ " ¨q¯´1 T , where the first inequality is due to ´log x ě 1 ´x; in the second inequality we use the definition of Hellinger distance and Cauchy inequality; the third inequality is because p ? x ´?yq 2 ď |x ´y| for all x, y P R ě0 ; the last inequality is due to our assumption that θ is covered by i. Notice that every θ P Θ is covered by some i P rns, and for such i, ř t´1 s"1 ℓ s i ď L t pθ ‹ q ´Lt pθq; therefore, it holds with probability 1 ´δ that, for all θ P Θ, t P rT s, 1 2 pL t pθ ‹ q ´Lt pθqq `logpn{δq `t ´1 T ě 1 2 t´1 ÿ s"1 D 2 H ´Pπ s θ , P πs θ ‹ ¯. Plugging in n " N Θ p1{T q and scaling the above inequality by 2 gives (42).

H EXPLORATIVE E2D, ALL-POLICY MODEL-ESTIMATION E2D, AND MOPS

In this section, we present the detailed algorithms of EXPLORATIVE E2D, ALL-POLICY MODEL-ESTIMATION E2D, and MOPS introduced in Section 4. We also state the theorems for their sample complexity bounds of learning ε-optimal policy of B-stable PSRs.

H.1 EXPLORATIVE E2D ALGORITHM

In this section, we provide more details about the EXPLORATIVE E2D algorithm as discussed in Section 4.2. The full algorithm of EXPLORATIVE E2D is given in Algorithm 2, equivalent to Chen et al. (2022, Algorithm 2) in the known reward setting (D 2 RL becomes D 2 H since we assumed that the reward is deterministic and known, so that the contribution from reward distance in D 2 RL becomes 0). Chen et al. (2022, Theorem F.1) showed that EXPLORATIVE E2D achieves the following estimation bound. Theorem H.1 (Chen et al. (2022) , Theorem F.1). Given an 1{T -optimistic cover p r P, Θ 0 q (c.f. Definition C.4) of the model class Θ, Algorithm 2 with η " 1{3 achieves the following with probability at least 1 ´δ: V ‹ ´Vθ ‹ pp π out q ď edec γ pΘq `10γ T rlog |Θ 0 | `2 logp1{δq `3s, where edec γ is the Explorative DEC as defined in Section 4.2. Algorithm 3 ALL-POLICY MODEL-ESTIMATION E2D (Chen et al., 2022) 1: Input: Model class Θ, parameters γ ą 0, η P p0, 1{2s. An 1{T -optimistic cover p r P, Θ 0 q. 2: Initialize µ 1 " UnifpΘ 0 q.  edec γ pΘq ď 9dAU A Λ 2 B H 2 {γ. Therefore, we can choose a suitable parameter γ and an 1{T -optimistic cover p r P, Θ 0 q, such that with probability at least 1 ´δ, Algorithm 2 outputs a policy p π out P ∆pΠq such that V ‹ ´Vθ ‹ pp π out q ď ε, as long as the number of episodes T ě O `dAU A Λ 2 B H 2 logpN Θ p1{T q{δq{ε 2 ˘. The proof of Theorem H.2 and hence Theorem 10 is contained in Appendix I.2.

H.2 ALL-POLICY MODEL-ESTIMATION E2D FOR MODEL-ESTIMATION

In this section, we provide more details about model-estimation learning in PSRs as discussed in Section 4.2. In reward-free RL (Jin et al., 2020b) , the goal is to optimally explore the environment without observing reward information, so that after the exploration phase, a near-optimal policy of any given reward can be computed using the collected trajectory data alone without further interacting with the environment. The ALL-POLICY MODEL-ESTIMATION E2D algorithm (Algorithm 3) for a PSR class Θ is given as follows: In each episode t P rT s, we maintain a distribution µ t P ∆pΘ 0 q over an 1{T -optimistic cover p r P, Θ 0 q of Θ (c.f. Definition C.4), which we use to compute an exploration policy distribution p t exp by minimizing the following risk: Theorem H.3. Given an 1{T -optimistic cover p r P, Θ 0 q (c.f. Definition C.4) of the class of transition dynamics Θ, Algorithm 3 with η " 1{2 achieves the following with probability at least 1 ´δ: sup π D TV ´Pπ p θ , P π θ ‹ ¯ď 6amdec γ pΘq `60γ T rlog |Θ 0 | `2 logp1{δq `3s, where amdec γ is the All-policy Model-Estimation DEC as defined in (43). We provide a sharp bound on the AMEDEC for B-stable PSRs, which implies that ALL-POLICY MODEL-ESTIMATION E2D can also learn them sample-efficient efficiently in a model-estimation manner. Theorem H.4. Suppose Θ is a PSR class with the same core test sets tU h u hPrHs , and each θ P Θ admits a B-representation that is Λ B -stable (c.f. Definition 4) or weakly Λ B -stable (c.f. Definition D.4), and has PSR rank d PSR ď d. Then amdec γ pΘq ď 6dAU A Λ 2 B H 2 {γ. Therefore, we can choose a suitable parameter γ and an 1{T -optimistic cover p r P, Θ 0 q, such that with probability at least 1 ´δ, Algorithm 3 outputs a model p θ P Θ such that sup π D TV ´Pπ p θ , P π θ ‹ ¯ď ε, as long as the number of episodes T ě O `dAU A Λ 2 B H 2 logpN Θ p1{T q{δq{ε 2 ˘. The proof of Theorem H.4 is contained in Appendix I.3.

H.3 MODEL-BASED OPTIMISTIC POSTERIOR SAMPLING (MOPS)

In this section, we provide more details about the MOPS algorithm as discussed in Section 4.3. We consider the following version of the MOPS algorithm of Agarwal & Zhang (2022) ; Chen et al. (2022) . Similar to EXPLORATIVE E2D, MOPS also maintains a posterior µ t P ∆pΘ 0 q over an 1{T optimistic cover p r P, Θ 0 q, initialized at a suitable prior µ 1 . The exploration policy in the t-th episode is obtained by posterior sampling: π t " π θ t ˝ht UnifpAq ˝ht `1 UnifpU A,h t `1q, where θ t " µ t and h t " Unifpt0, 1, . . . , H ´1uq. After executing π t and observing τ t , the algorithm updates the posterior as µ t`1 pθq 9 θ µ 1 pθq exp ´t ÿ s"1 `γ´1 V θ pπ θ q `η log r P π s θ pτ s q ˘¯. Finally, the algorithm output p π out :" 1 T ř T t"1 p out pµ t q, where p out pµ t q P ∆pΠq is defined as p out pµqpπq " µptθ : π θ " πuq, @π P Π. (45) We further consider the following Explorative PSC (EPSC), which is a modification of the PSC proposed in Chen et al. (2022, Definition 4) : psc est γ pΘ, θq " sup µP∆0pΘq E θ"µ " V θ pπ θ q ´Vθpπθq ´γE π"µ " D 2 H `Pπexp θ , P πexp θ ˘‰‰ , Algorithm 4 MODEL-BASED OPTIMISTIC POSTERIOR SAMPLING (Agarwal & Zhang, 2022) 1: Input: Parameters γ ą 0, η P p0, 1{2q. An 1{T -optimistic cover p r P, Θ 0 q 2: Initialize: µ 1 " UnifpΘ 0 q 3: for t " 1, . . . , T do 4: Sample θ t " µ t and h t " Unifpt0, 1, ¨¨¨, H ´1uq.

5:

Set π t " π θ t ˝ht UnifpAq ˝ht `1 UnifpU A,h`1 q, execute π t and observe τ t . 6: Compute µ t`1 P ∆pΘ 0 q by µ t`1 pθq 9 θ µ 1 pθq exp ´t ÿ s"1 `γ´1 V θ pπ θ q `η log r P π s θ pτ s q ˘¯. Output: Policy p π out :" 1 T ř T t"1 p out pµ t q, where p out p¨q is defined in (45). where ∆ 0 pΘq is the set of all finitely supported distributions on Θ, π exp is defined as π exp " 1 H ř H´1 h"0 π ˝h UnifpAq ˝h`1 UnifpU A,h`1 q, and we abbreviate π " p out pµq to π " µ. Adapting the proof for the MOPS algorithm in Chen et al. (2022, Corollary D.3 & Theorem D.1) to the explorative version, we can show that the output policy p π out of MOPS has a sub-optimality gap that scales as psc est . Theorem H.5. Given an 1{T -optimistic cover p r P, Θ 0 q (c.f. Definition C.4) of the class of PSR models Θ, Algorithm 4 with η " 1{6 and γ ě 1 achieves the following with probability at least 1 ´δ: V ‹ ´Vθ ‹ pp π out q ď psc est γ{6 pΘ, θ ‹ q `2 γ `γ T rlog |Θ 0 | `2 logp1{δq `5s, where psc est γ is the Explorative PSC as defined in (46). We provide a sharp bound on the EPSC for B-stable PSRs, which implies that MOPS can also learn them sample-efficient efficiently. Theorem H.6. Suppose Θ is a PSR class with the same core test sets tU h u hPrHs , and each θ P Θ admits a B-representation that is Λ B -stable (c.f. Definition 4) or weakly Λ B -stable (c.f. Definition D.4), and the ground truth model θ ‹ has PSR rank at most d. Then psc est γ pΘ, θ ‹ q ď 6Λ 2 B dAU A H 2 {γ. Therefore, we can choose a suitable parameter γ and an 1{T -optimistic cover p r P, Θ 0 q, such that with probability at least 1 ´δ, Algorithm 4 outputs a policy p π out P ∆pΠq such that V ‹ ´Vθ ‹ pp π out q ď ε, as long as the number of episodes T ě O `dAU A Λ 2 B H 2 logpN Θ p1{T q{δq{ε 2 ˘. The proof of Theorem H.6 is contained in Appendix I.2. We remark here that EPSC provides an upper bound of EDEC (c.f. Eq. ( 55)), So Theorem H.2 (and hence Theorem 10) directly follows from Theorem H.6.

I PROOFS FOR APPENDIX H

For the clarity of discussion, we introduce the following notation in this section: for policy π, we denote φ h to be a policy modification such that φ h ˛π " π ˝h UnifpAq ˝h`1 UnifpU A,h`1 q. Again, here φ h ˛π means that we follow π for the first h ´1 steps, takes UnifpAq at step h, takes an action sequence sampled from UnifpU A,h`1 q at step h `1, and behaves arbitrarily afterwards. Such definition agrees with (47). We further define the φ policy modification as φ ˛π " 1 H ř H´1 h"0 φ h ˛π " 1 H ř H´1 h"0 π ˝h UnifpAq ˝h`1 UnifpU A,h`1 q. (47) We call φ ˛π the exploration policy of π.

I.1 PROOF OF THEOREM H.6

To prove Theorem H.6, due to Theorem H.5, we only need to bound the coefficients psc est γ pΘ, θ ‹ q. By its definition, we have psc est γ pΘ, θ ‹ q " sup , P φ˛π θ ˘‰, where φ ˛π defined in (47) is the exploration policy of π. Proof of Proposition I.2. In the following, we fix a θ P Θ and abbreviate E " E θ , q " q θ . Then, by Proposition F. (49) Note that for the term E θ"µ " E θ,0 ‰ , we have E θ"µ " E θ,0 ‰ ď c E θ"µ " E 2 θ,0 ı . ( ) We next consider the case for h P rHs, and upper bound the corresponding terms in the righthand-side of (49) using the decoupling argument introduced in Appendix E.2, restated as follows for convenience. |xx, y θ,j,r y| , @x P R n , where ty θ,j,r u pθ,j,rqPΘˆrJsˆR Ă R n is a family of bounded vectors in R n . Then for any distribution µ over Θ and probability family tq θ u θPΘ Ă ∆pIq, E θ"µ E i"q θ rf θ px i qs ď b d X E θ,θ 1 "µ E i"q θ 1 rf θ px i q 2 s, where d X is the dimension of the subspace of R n spanned by px i q iPI . We have the following three preparation steps to apply Proposition E.6: 1. Recall that E " E θ is defined in Proposition F.1. Let us define y θ,j,π :" 1 2 πpτ j h:H q ˆ"B θ H:h`1 pτ j h`1:H q ´Bθ h po j h , a j h q ´Bθ h po j h , a j h q ¯ıJ P R d , where tτ j h:H " po j h , a j h , ¨¨¨, o j H , a j H qu ny j"1 is an ordering of all possible τ h:H (and hence n y " pOAq H´h`1 ), π is any policy (that starts at step h). We then define f θ pxq " max π ř j |xy θ,j,π , xy| , x P R h | . Then it follows from definition (c.f. Proposition F.1) that E θ,h pτ h´1 q " f θ pqpτ h´1 qq. 2. We define x i " q ‹ pτ i h´1 q P R |U h | for i P I " pO ˆAq h´1 where tτ i h´1 u iPI is an ordering of all possible τ h´1 P pO ˆAq h´1 . Then by our definition of PSR rank (c.f. Definition 3), the subspace of R |U h | spanned by tx i u iPI has dimension less than or equal to dθ. 3. We take q θ P ∆pIq as q θ piq " E π"µp¨|θq " P π θ pτ h´1 " τ i h´1 q ‰ , i P I " pO ˆAq h´1 . (51) Therefore, applying Proposition E.6 to function family tf θ u θPΘ , vector family tx i u iPI , and distribution family tq θ u θPΘ givesfoot_21 E pθ,πq"µ " Eθ ,π " E θ,h pτ h´1 q ‰‰ " E θ"µ E i"q θ rf θ px i qs ď b dθE θ,θ 1 "µ E i"q θ 1 rf θ px i q 2 s " c dθE θ"µ E π"µ " Eθ ,π " E 2 θ,h pτ h´1 q ıı . Combining Eq. ( 50), ( 52 where the third inequality is due to Cauchy-Schwarz inequality, and the fourth inequality is due to Proposition F.2. This completes the proof of Proposition I.1. I.2 PROOF OF THEOREM H.2 (THEOREM 10) According to Theorem H.1, in order to prove Theorem H.2 (Theorem 10), we only need to bound the coefficients edec γ pΘq for γ ą 0. In the following, we bound edec by psc est using the idea of Chen et al. (2022, Proposition 6) . Recall that edec is defined in (4.2). By strong duality (c.f. Theorem C.2), we have edec γ pΘ, µq :" inf pexpP∆pΠq poutP∆pΠq sup θPΘ E π"pout rV θ pπ θ q ´Vθ pπqs ´γEθ "µ E π"pexp " D 2 H pP π θ , P π θ q ‰ " sup µP∆0pΘq inf pexpP∆pΠq poutP∆pΠq E θ"µ E π"pout rV θ pπ θ q ´Vθ pπqs ´γE θ"µ Eθ "µ E π"pexp " D 2 H pP π θ , P π θ q ‰ . (53) Note that |V θ pπq ´Vθpπq| ď D TV `Pπ θ , P π θ ˘ď D H `Pπ θ , P π θ ˘. Therefore, we can take p out " p µ , where p µ is defined as p µ pπq " µptθ : π θ " πuq. Then for a fixed α P p0, 1q, we have E θ"µ E π"pµ rV θ pπ θ q ´Vθ pπqs ďE θ"µ Eθ "µ E π"pµ " D H `Pπ θ , P π θ ˘‰ `Eθ"µ Eθ "µ E π"pµ rV θ pπ θ q ´Vθpπqs "E θ"µ Eθ "µ E π"pµ " D H `Pπ θ , P π θ ˘‰ `Eθ"µ Eθ "µ rV θ pπ θ q ´Vθpπθqs ď 1 4p1 ´αqγ `γE θ"µ Eθ "µ E π"pµ " D 2 H `Pπ θ , P π θ ˘‰ `Eθ"µ Eθ "µ rV θ pπ θ q ´Vθpπθqs , (54 ) where the equality is due to our choice of p µ : E θ"µ E π"pµ rVθpπqs " E π"pµ rVθpπqs " E θ"µ rVθpπ θ qs , and the last inequality is due to AM-GM inequality. Therefore, we can take p exp " αp µ `p1 ´αqp e P ∆pΠq, where p e is given by p e pπq " µptθ : φ ˛πθ " πuq,foot_22 and using this choice of p exp and p out in Eq. ( 53) and using Eq. ( 54 



For ν-future-sufficient POMDPs, Wang et al. (2022)'s sample complexity depends on γ, which is an additional l-step past-sufficiency parameter that they require. This definition can be generalized to continuous U h , where B h po h , a h q P LpL 1 pU h q, L 1 pU h`1 qq are linear operators instead of (finite-dimensional) matrices. This definition using matrix ranks may be further relaxed, e.g. by considering the effective dimension. For the m-step versions of our structural conditions, we allow an exponential dependence on m but not H. Such a dependence is necessary, e.g. in m-step decodable POMDPs(Efroni et al., 2022). The Π 1 -norm is in general a semi-norm. It is straightforward to generalize this example to the case when S and O are infinite by replacing vectors with L1 integrable functions, and matrices with linear operators between these spaces. Named CRANE in (Zhan et al., 2022). Uehara et al. (2022b) achieves an A M σ ´2 1 dependence for learning the optimal memory-M policy in (their) σ1-revealing POMDPs, which is however easier than learning the globally optimal policy considered here. The log-factor ι contains additional parameter R B that is not always controlled by Λ B ; this quantity also appears inZhan et al. (2022);Liu et al. (2022b) but is controlled by their α ´1 psr or γ ´1 respectively. Nevertheless, for all of our POMDP instantiations, R B is polynomially bounded by other problem parameters so that ι is a mild log-factor. Further, our next algorithm EXPLORATIVE E2D avoids the dependence on R B (Theorem 10). Here we introduce the constant 2 in the square root in order for weak B-stability to be weaker than Bstability (Definition 4). Note that under such formulation, M 1 has deterministic rewards. The terminal state sH`1 is a dummy state. A latent MDP M has transition rank d if each Mm has rank d as a linear MDP(Jin et al., 2020c). For simplicity, we write T h,a :" T h p¨|¨, aq P R SˆS the transition matrix of action a P A. For the clarity of presentation, in this section we adopt the following notation: for operator pLnq nPN , we write ś m h"n L h " Lm ˝¨¨¨˝Ln. Here, for h " 0, we take Θ0;t " ␣ µ θ 1 ( θPΘ . The optimistic covers of the emission matrices Θ h;o and transitions Θ h;t are defined as inChen et al. (2022, Definition C.5) with context π being s and ps, aq, and output being o and s, respectively. It is worth noting that the pP h q we define is exactly the transition dynamics of the associated megastate MDP(Efroni et al., 2022). For h " H, we understand BH po, aq " r1pt " oqs tPU H because oH`1 " o dum always. Liu et al. (2022b) only asserts a polynomial rate without spelling out the concrete powers of the problem parameters. This rate is extracted fromLiu et al. (2022b, Proposition C.5 & Lemma C.6). The boundedness of ty θ,j,π u is trivially satisfied, because µ0 is finitely supported. Here, pe is technically a distribution over the set of mixed policies ∆pΠq, and can be identified with a mixed policy in ∆pΠq.



Farina et al. (2021);Kozuno et al. (2021);Bai et al. (2022a;b); Song et al. (2022), where the sample complexity scales polynomially in the size of the game tree (typically exponential in the horizon). This line of results is in general incomparable to ours as their tree structure assumption is different from B-stability.Learning PSRsPSRs is proposed in Littman & Sutton (2001); Singh et al. (2012); Rosencrantz et al. (2004); Boots et al. (

) only requires an ℓ 1 bound. In comparison, Liu et al. (2022a); Zhan et al. (2022) only obtain a precondition in ℓ 1 , and thus has to perform an ℓ 1 -Eluder argument which results in an additional d factor in the final sample complexity. Combining (10), (13) (summed over k P rKs) and (14) completes the proof of Theorem 9. C TECHNICAL TOOLS C.1 TECHNICAL TOOLS

PROOFS FOR SECTION 3 D.1 BASIC PROPERTY OF B-REPRESENTATION Proposition D.1 (Equivalence between PSR definition and B-representation).

Proposition E.6 (Decoupling argument). Suppose we have vectors and functionstx i u iPI Ă R n , tf θ : R n Ñ Ru θPΘwhere Θ, I are arbitrary abstract index sets, with functions f θ given by f θ pxq :" max

respectively 1 . All results are scaled to the setting with total reward in r0, 1s. H 2 plog NΘq 2 ¨ν4 γ 2 {ε 2 ˘r O `dtransA 2m´1 H 2 log NΘ ¨ν2 {ε 2 decodable rank-dtrans POMDP r O `dtransA m H 2 log NG{ε 2 ˘r O `dtransA m H 2 log NΘ{ε 2 2010;

PSR ď S, and U A " A m´1 (Example 5). Therefore, both Theorem 9 & 10 achieve sample complexity rO `S2 A m H 2 log N Θ {pα 2 rev ε 2 q ˘.This improves substantially over the current best result r OpS 4 A 6m´4 H 6 log N Θ {pα 4 rev ε 2 qq of Liu et al. (2022a, Theorem 24). For tabular POMDPs, we further have log N Θ ď r OpHpS 2 A `SOqq. r O `dtrans A m H 2 log N Θ {ε 2 ˘. Compared with the sample complexity upper bound r Opd trans A m H 2 log N G {ε 2 q of Efroni et al. (

Matthew Rosencrantz, Geoff Gordon, and Sebastian Thrun. Learning low dimensional predictive representations. In Proceedings of the twenty-first international conference on Machine learning, pp. 88, 2004. Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive pomdps. Advances in neural information processing systems, 20, 2007. Satinder Singh, Michael James, and Matthew Rudary. Predictive state representations: A new theory for modeling dynamical systems. arXiv preprint arXiv:1207.4167, 2012. Ziang Song, Song Mei, and Yu Bai. Sample-efficient learning of correlated equilibria in extensiveform games. arXiv preprint arXiv:2205.07223, 2022. Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pp. 2898-2933. PMLR, 2019.

, where the first inequality is due to the Cauchy-Schwarz inequality; the second inequality is due to AM-GM inequality; the last inequality is because max aPU A,h ř o:po,aqPU h vpo, aq ď }v} Π . Next, we have

h ps p|S|q q L h ps p|S|q q

Zhan et al., 2022)  is a subclass of low-rank POMDPs where the latent transition and emission dynamics are linear in certain known feature maps. In the following, we present a slightly more general version of the linear POMDP definition in Zhan et al. (2022, Definition 5). Definition D.14 (Linear POMDP). A POMDP is linear with respect to the given set Ψ of feature maps pψ h

, if u induces a emission dynamic O h;v , then r O h;u po|sq ě O h;u po|sq, and

D.3.6 REGULAR PSRS Proposition D.18 (Regular PSRs are B-stable). Any α psr -regular PSR admits a Brepresentation pBq such that for all 1 ď h ď H, }B H:h x} Π ď }K : h x} 1 , where K h is any core matrix of D h (cf. Example 8). Hence, any α psr -regular PSR is B-stable with Λ B ď ? U A α ´1 psr .As a byproduct, we show that the B-representation also has R B ď α ´1 psr AU A .

the case h " H, note that by Corollary D.2,πpa H |o H q ˇˇ´B θ H po H , a H qq θ pτ H´1 q ´Bθ H po H , a H qq θ pτ H´1 q|P θ po H |τ H´1 q ´PθpoH |τ H´1 q| ď 2D H pP θ p¨|τ H´1 q, Pθp¨|τ H´1 qq ,

3: for t " 1, . . . , T do As we can see from the theorem above, as long as we can bound edec γ pΘq, we can get a sample complexity bound for the EXPLORATIVE E2D algorithm. This gives Theorem 10 in the main text, which we restate as below.

Then, we execute policy π t " p t exp , collect trajectory τ t , and update the model distribution using the same Tempered Aggregation scheme as in EXPLORATIVE E2D. After T episodes, we output the emipirical model p θ by computing µ out " 1

), and (49) yields

), we get edec γ pΘ, µq ď sup Recall that psc est γ has been bounded in Theorem H.6. Taking α " 3{4 yields edec γ pΘ, µq ď p8Λ 2 B dAU A H 2 `1q{γ. This completes the proof of Theorem H.2. I.3 PROOF OF THEOREM H.4 To prove Theorem H.4, due to Theorem H.3, we only need to bound the coefficients amdec γ pΘ, p µq for all p µ P ∆pΘq. By strong duality (c.f. Theorem C.2), we have

annex

Hence, we havewhere (i) is due to the fact that max πP∆pAq ř aPA πpaqxpaq ď `řa xpaq 2 ˘1{2 , (ii) is due to Cauchy-Schwarz inequality, in (iii) we include those o such that po, aq may not belong to U h`1 into summation, (iv) is due to (35): UnifpAq ˝UnifpU A,h`1 q is simply the uniform policy over A ˆUA,h`1 . This concludes the proof of Lemma F.4.

G PROOF OF THEOREM 9

We first restate Theorem 9 as follows in terms of the (more relaxed) weak B-stability condition. Theorem G.1 (Restatement of Theorem 9). Suppose every θ P Θ is Λ B -stable (Definition 4) or weakly Λ B -stable (Definition D.4), and the true model θ ‹ P Θ with rank d PSR ď d. Then, choosing β " C logpN Θ p1{KHqq{δq for some absolute constant C ą 0, with probability at least 1 ´δ, Algorithm 1 outputs a policy p π out P ∆pΠq such that V ‹ ´Vθ ‹ pp π out q ď ε, as long as the number of episodeswhere ι :" log p1 `KdU A Λ B R B q with R B :" max h t1, max }v} 1 "1 ř o,a }B h po, aqv} 1 u. The proof of Theorem G.1 uses the following fast rate guarantee for the OMLE algorithm, which is standard (e.g. Van de Geer (2000) ; Agarwal et al. (2020) ). For completeness, we present its proof in Appendix G.1. Proposition G.2 (Guarantee of MLE). Suppose that we choose β ě 2 log N Θ p1{T q`2 logp1{δq`2 in Algorithm 1. Then with probability at least 1 ´δ, the following holds: We next prove Theorem G.1. We adopt the definitions of E θ θ,h pτ h´1 q as in Proposition F.1 and abbreviate E ‹ k,h " E θ ‹ θ k ,h . We also condition on the success of the event in Proposition G.2.

