PAC REINFORCEMENT LEARNING FOR PREDICTIVE STATE REPRESENTATIONS

Abstract

In this paper we study online Reinforcement Learning (RL) in partially observable dynamical systems. We focus on the Predictive State Representations (PSRs) model, which is an expressive model that captures other well-known models such as Partially Observable Markov Decision Processes (POMDP). PSR represents the states using a set of predictions of future observations and is defined entirely using observable quantities. We develop a novel model-based algorithm for PSRs that can learn a near optimal policy in sample complexity scaling polynomially with respect to all the relevant parameters of the systems. Our algorithm naturally works with function approximation to extend to systems with potentially large state and observation spaces. We show that given a realizable model class, the sample complexity of learning the near optimal policy only scales polynomially with respect to the statistical complexity of the model class, without any explicit polynomial dependence on the size of the state and observation spaces. Notably, our work is the first work that shows polynomial sample complexities to compete with the globally optimal policy in PSRs. Finally, we demonstrate how our general theorem can be directly used to derive sample complexity bounds for special models including m-step weakly-revealing and m-step decodable tabular POMDPs, POMDPs with low-rank latent transition, and POMDPs with linear emission and latent transition. Tests and Linear PSRs. A test is a sequence of future observations and actions. For some test t h = (o h:h+W -1 , a h:h+W -2 ) with length W ∈ N + , we define the probability of test t h being successful conditioned on reachable history τ h-1 as P(t h |τ h-1 ) := P(o h:h+W -1 |τ h-1 ; do(a h:h+W -2 )), i,e., the probability of observing o h:h+W -1 by actively executing actions a h:h+W -2 conditioned on history τ h-1 . 1 When the history τ h-1 is unreachable, i.e., 1 The do operator means that P(o h:h+W -1 |τ h-1 ; do(a h:h+W -2 )) = h+W -1 t=h P(ot|τ h-1 , o h:t-1 , a h:t-1 ). Here, we remark conditional probability of o h:h+W -1 given τ h-1 is not specified not only by dynamics, but also by the policy. Given a policy π, conditional probability of o h:h+W -1 given τ h-1 under a policy πt is P(o h:h+W -1 |τ h-1 ; a h:h+W -2 ∼ π) ∝ h+W -1 t=h P(ot|τt-1)πt(at|τt-2, ot). The do(a h:h+W -2 ) operator can be understood as a policy that deterministically picks actions at for h ≤ t ≤ h + W -2, i.e., πt(At = •|τt-1, ot) = δa t .

1. INTRODUCTION

Efficient exploration strategies in reinforcement learning have been well investigated on many models from tabular models [25, 2] to models with general function approximation [10, 27, 30, 16, 42] . These works have focused on fully observable Markov decision processes (MDPs); however, their algorithms do not result in statistically efficient algorithms in partially observable Markov decision processes (POMDPs). Since the markovian properties of dynamics are often questionable in practice, POMDPs are known to be useful models that capture environments in real life. While strategic exploration in POMDPs was less investigated due to its difficulty, it has been actively studied in recent few years [20, 3, 29] . In our work, we consider Predictive state representation (PSR) [36, 41, 24] that is a more general model of controlled dynamical systems than POMDPs. PSRs are specified by the probability of a sequence of future observations/actions (referred to as a test) conditioned on the past history. Unlike the POMDP model, PSR directly predicts the future given the past without modeling the latent state/dynamics. PSRs can model every POMDP, but potentially result in much more compact representations; there are dynamical systems that have finite PSR ranks, but that cannot be modeled by any POMDPs with finite latent states [36, 24] . PSRs are not only general but also amenable to learning and scalable. First, PSRs can be efficiently learned from exploratory data using a spectral learning algorithm [6] motivated by methodof-moments [23] . This learning algorithm allows us to perform fast closed-form sequential filtering, unlike EM-type algorithms that would be the most natural algorithm derived from POMDP perspectives. Secondly, while original PSRs are defined in the tabular setting, PSRs also support rich functional forms through kernel mean embedding [4] . Variants of PSRs equipped with neural networks have been proposed as well [43, 9, 46, 49] . In spite of the abovementioned advances in research on PSRs made in the recent two decades, strategic exploration without exploratory data has been barely investigated. To make PSRs more practical, it is of significant importance to understand how to perform efficient strategic exploration. m-step weakly-revealing POMDPs m-step decodable POMDPs PSRs Efroni et al. [13] + Liu et al. [37] + Jiang et al. [26] • Uehara et al. [45] • + • Our Work + + + Table 1 : Comparison of our work with existing works. + means that algorithms can learn the near globally optimal policy with polynomial sample complexities. Our work is the only work that has a desirable guarantee on three models. In m-step weakly-revealing POMDPs, • in Uehara et al. [45] means the sample complexity is quasi-polynomial but not polynomial. In m-step decodable POMDPs, all of the works have certain caveats. More specifically, in Efroni et al. [13] , Uehara et al. [45] , it is unclear whether they can avoid poly(|O| m ). On the other hand, our result can surprisingly avoid poly(|O| m ) while we need a regularity assumption. For more details, refer to Section 5. In PSRs, • in Jiang et al. [26] means the guarantee is limited to reactive PSRs where the optimal value function depends on current observations. Similarly, • in Uehara et al. [45] means the algorithm can compete with short-memory policies but not near globally optimal policies. To the best of the author's knowledge, Jiang et al. [26] , Uehara et al. [45] tackle this challenge; however, they fail to show results with polynomial sample complexity to compete with the globally optimal policy. We aim to obtain algorithms that can compete with the globally optimal policy with polynomial sample complexity. Another desideratum for algorithms is to permit for general function approximation. This desideratum is important to enjoy the scalable property of PSRs. In summary, the key question we wish to address in this work is: Can we design provably efficient RL algorithms for learning PSR with function approximation? Contributions. Our main contributions are summarized below. This is summarized in Table 1 . 1. We develop the first PAC learning algorithm for PSRs that can compete with the globally optimal policy and identify the PSR rank d PSR as the key structural quantity of PSR systems. Starting with a realizable model class, our algorithm learns a near-optimal policy with sample complexity scaling polynomially in d PSR and the statistical complexity (log bracket number of the model class), without any explicit polynomial dependence on the size of state and observation space. Thus, our approach can be applied to large-scale partially observable systems. 2. We demonstrate how our general result can be seamlessly applied to existing POMDP models with function approximation. These models include tabular m-step weakly-revealing POMDPs [37] and tabular m-step decodable POMDPs [13] . Especially, our work is the first work that ensures PAC guarantees with polynomial sample complexities for m-step weakly-revealing POMDPs and m-step decodable POMDPs simultaneously. We further show sample complexity results when these two types of POMDPs have additional two types of structures to permit for large state/observation space: with low-rank latent transition and with linear latent transition and observation distributions, which all have d PSR much smaller than |S|. Notations. In this work we use [n] to denote the set {1, 2, • • • , n} and [n] + to denote the set {0, 1, 2, • • • , n} for any positive integer n. For any set C, we use |C| to denote its cardinality and [x c ] c∈C to denote the vector whose entry is x c for all c ∈ C. We also use ∆ C to represent the set of all probability distributions over C. For any vector x, we use x 1 , x 2 and x ∞ to denote its 1 , 2 and ∞ norm. For any matrix M , we use (M ) i,j to denote the (i, j)-th entry of M and M † to denote the pseudo inverse of M . We also use M ∞,∞ to denote max i,j |(M ) i,j | and M 1 →1 to denote its 1 norm sup x 1=1 M x 1 . In addition, we use σ min (M ) to denote the minimum nonzero singular value of M and σ n (M ) to denote the n-th largest singular value of M . Related works. Our work is mostly related to the literature on provable online RL algorithms for PSRs without offline exploratory data. Although there is a growing body of literature that discusses efficient online learning for POMDPs under various structures [3, 20, 34, 35, 40, 7] , there are few works [26, 45] that study strategic exploration in PSRs and none of them obtain polynomial sample complexity results for learning globally optimal policies. See Appendix B for details. Figure 1 : Illustration of the key concepts in PSRs using the system dynamics matrix D h indexed by all tests and all histories. Denote d PSR,h as the rank of D h . A core test set U h+1 is a subset of tests such that the submatrix D h whose rows are indexed by tests in U h+1 has rank d PSR,h . Thus any row in D h can be written as a linear combination of the rows in D h . A core test set U h+1 whose size is exactly equal to d PSR,h is called a minimum core test set. The minimum core history set is a size-d PSR,h subset of histories such that the submatrix K h of D h whose columns are indexed by the history in the minimum core history set has rank d PSR,h . Any column in D h can be written as a linear combination of columns indexed by histories in the minimum core history set.

2. PRELIMINARIES

In this section, we introduce the definition and key properties of PSRs. After that, we state our learning objective for PSRs with function approximation.

2.1. PREDICTIVE STATE REPRESENTATIONS

We consider an episodic sequential decision making process P = {O, A, P, {r h } H h=1 , H}, where O is the observation space, A is the action space, P is the system dynamics, r h is the reward function at h-th step and H is the length of each episode. We suppose the reward r h at h-th step is a deterministic function of (o h , a h ) conditioned on the history τ h where τ h = (o 1 , a 1 , • • • , o h , a h ). We assume the initial observation o 1 of each episode follows a fixed distribution µ 1 ∈ ∆ O . At step h ∈ [H], the agent observes the observation o h and takes action a h based on the whole history (τ h-1 , o h ). After that, the agent receives its reward r h (o h , a h ) and the environment generates o h+1 ∼ P(•|τ h ). After the agent takes a H , we suppose the environment will only generate dummy observations o dummy no matter what actions the agent takes afterward.

Policy and value. A policy π

= {π h : (O × A) h-1 × O → ∆ A } H h=1 specifies the action selection probability at each step conditioned on the history (τ h-1 , o h ). Given any policy π, its value V π is the expected cumulative reward as defined below: V π := E π [ H h=1 r h ], where the expectation is w.r.t. to the distribution of the trajectory induced by executing π in the environment. We use P π (τ ) to represent the probability of trajectory τ when executing policy π in the environment. P π (τ h-1 ) = 0 for all policy π, we define the conditional probability P(t h |τ h-1 ) to be 0. Now, we define the one-step system dynamics matrix D h whose rows are indexed by tests and columns are indexed by histories, and the entry corresponding to the test-history pair (t h+1 , τ h ) is equal to P(t h+1 |τ h ) (see Fig 1 for an illustration). Denote d PSR,h = rank(D h ). Then Linear PSRs are defined to be systems with low-rank one-step system dynamic matrices: Definition 1. A partially observable system is called a Linear PSR with rank d PSR if max h rank(D h ) = d PSR . Core tests. For time step h, consider a set of tests U h+1 ⊂ ∪ C∈N + O C × A C-1 . If the submatrix D h (see the matrix inside the green box in Fig. 1 ) of D h whose rows are indexed by the tests in U h+1 and columns are indexed by all histories, has rank equal to d PSR,h , then we call such set U h+1 as a core test set. The key property of such a core test set is that from linear algebra, for any row in D h , we can express it as a linear combination of the rows of D h . This is formalized as follows. Lemma 1 (Core test sets in linear PSRs). For any h ∈ [H -1] + , a set U h+1 ⊂ ∪ C∈N + O C × A C-1 is a core test set at (h + 1)-th step if and only if we have for any W ∈ N + , any possible future (i.e., test) t h+1 = (o h+1:h+W , a h+1:h+W -1 ) ∈ O W × A W -1 and any history τ h , there exists m t h+1 ,h+1 ∈ R |U h+1 | such that P(t h+1 |τ h ) = m t h+1 ,h+1 , [P(u|τ h )] u∈U h+1 . The vector [P(u|τ h )] u∈U h+1 is referred to as the predictive state at (h + 1)-th step. Throughout this work, we use q τ h to denote [P(u|τ h )] u∈U h+1 and q 0 to represent the initial predictive states [P(u)] u∈U1 . In particular, we are interested in the set of all action sequences in U h and denote it by U A,h . A core test set with the smallest number of tests is called a minimum core test set, which we denote by Q h . Note that by the definition of the rank, we know that |Q h+1 | = d PSR,h . To simplify writing, we further define |U| := max h∈[H] |U h |, |U A | := max h∈[H] |U A,h |. In this paper we assume a core test U h (we will see that this is a natural assumption for models such as POMDPs) is given while Q h is unknown. This setting is standard in literature on PSRs [6] . Minimum core histories. Similar to the minimum core test set, we can define the minimum core history set as well. Consider the matrix D h in Figure 1 . Recall that the columns of D h are indexed by all possible h-length histories and each column is q τ h . Since D h has rank d PSR,h , there must exist d PSR,h histories τ 1 h , • • • , τ d PSR,h h , such that any column in D h is a linear combination of the columns in D h that correspond to histories τ 1 h , • • • , τ d PSR,h h . In other words, for any h-length history τ h , there exists a vector v τ h ∈ R d PSR,h which satisfies q τ h = K h v τ h , where K h ∈ R |U h+1 |×d PSR,h is a full-rank matrix whose i-th column is q τ i h . We call {τ 1 h , • • • , τ d PSR,h h } as the minimum core histories at step h and K h as the core matrix -see Figure 1 for an illustration of K h . Particularly, when h = 0, we have K 0 = q 0 . Note (2) shows that all h-length histories can be captured by the core histories in the sense that the predictive states given any history can be expressed as a linear combination of the predictive states corresponding to the minimum core histories. The minimum core histories and the core matrix may not be unique given the core test set. Here we particularly define K h to be the core matrix with the smallest K † h 1 →1 to facilitate our subsequent analysis. PSRs vs POMDPs. PSRs have much stronger expressivity than POMDPs. All POMDPs can be expressed as PSRs with the minimum core test set size as most |S| while PSRs are not necessarily compact POMDPs [36] . In Appendix D we construct a sequential decision making process where if we want to formulate it into a POMDP, the number of states we need will be exponentially larger than the core test set size in the PSR formulation. The key intuition behind the construction is simple: the non-negative rank of a matrix could be exponentially larger than its rank. In the literature [41] , there are also some other concrete instances like probability clock which POMDPs cannot model with finite latent states while PSRs can model with finite rank. In the following, we explain that POMDPs are PSRs. Consider an episodic POMDP (S, O, A, {T h } H h=1 , {O h } H h=1 , {r h } H h=1 , H, µ 1 ) where S is the state space, O is the observation space, A is the action space, O h is the emission matrix at h-th step, r h is the reward function at h-th step and µ 1 is the initial state distribution, T h is the transition matrix at step h where (T h,a ) s ,s = P h (s |s, a). Then the following lemma shows that any POMDP is a PSR and its minimum core test set size is no larger than |S|, whose proof is deferred to Appendix E: Lemma 2. All POMDPs satisfy the definition of PSRs (Definition 1) such that d PSR ≤ |S|. After showing any POMDP is a linear PSR with rank at most |S|, now we demonstrate that under what conditions we could find a core test set. We focus on 1-step weakly-revealing POMDPs [29, 37] here, i.e., the rank of O h is |S| for all h, then we can show that O is a core test set. Lemma 3. When rank(O h ) = |S| for all h ∈ [H], the POMDP is a PSR with the core test set U h = O for all h ∈ [H]. We defer the proof and other examples including m-step weakly-revealing POMDPs [37] , latent MDPs [34] , m-step decodable POMDPs [13] and low-rank POMDPs to Appendix E and Section 5.

2.1.2. SYSTEM DYNAMICS OF PSRS

Predictive states can evolve just like the beliefs in POMDPs, which indicates that we can characterize the system dynamics of the PSRs via predictive states efficiently. In particular, for any o ∈ O, a ∈ A, h ∈ [H], let M o,a,h ∈ R |U h+1 |×|U h | denote the matrix whose rows are m (o,a,u),h (defined in Lemma 1) for u ∈ U h+1 (note that o, a, u can be understood as a test that starts with o, a, followed by u). Then the probability of an arbitrary trajectory can be expressed as the product of M o,a,h , m o H ,H , q 0 , as shown in the following lemma: Lemma 4. For any trajectory τ H and policy π, we have P π (τ H ) = m o H ,H • H-1 h=1 M o h ,a h ,h • q 0 • π(τ H ), where π(τ H ) := H h=1 π(a h |τ h-1 , o h ) is the probability of the actions chosen in the trajectory.

More generally, for any h ∈ [H] and trajectory τ h , letting b τ

h := { h l=1 M o l ,a l ,l }q 0 , we have [P(u|τ h )P π (τ h )] u∈U h+1 = b τ h × π(τ h ), The proof is deferred to Appendix F. Lemma 4 shows that the parameters {M o,a,h , m o,H , q 0 } o∈O,a∈A,h∈[H-1] are sufficient to characterize a PSR. Here we call M o,a,h the predictive operator matrix. Recall that in POMDPs, the same decomposition holds since we can represent {M o,a,h , m o,H , q 0 } using {T h , O h } as we see in the proof of Lemma 3. However, as emphasized in Singh et al. [41] , the main reason PSRs are more expressive is that {M o,a,h , m o,H , q 0 } are not constrained to be non-negative.

2.1.3. LEARNING OBJECTIVE

In this paper, we study online learning in PSRs and want to find the optimal policy. Suppose the output policy is π, then our goal is to find an -optimal policy with polynomial number of samples such that: V * -V π ≤ , where V * := V π * = sup π V π and π * is the optimal policy.

2.2. FUNCTION APPROXIMATION

To deal with the potentially large observation and action space, we consider learning with function approximation in this paper. We assume a function class F to approximate the true model and let P π f (τ H ) denote the probability of any trajectory τ H under any policy π and model f . Here we assume that the models in F are all valid PSRs with core test set {U h } h∈[H] , which implies that for each f ∈ F, we can calculate its corresponding predictive operator matrices, initial predictive states and core matrices, denoted by M o,a,h;f , q 0;f , K h;f respectively. We define V π f to be the value of policy π under model f . We also use f * to represent the true model for consistency. Generally, we put models on {M o,a,h;f , q 0:f } since this is the most natural parametrization in PSRs. When we have more prior knowledge about models like models are POMDPs, we can also put models on {T h , O h , µ 1 }. To measure the size of F, we use |F| to denote its cardinality when F is finite. For infinite function classes, we introduce the -bracket number to measure its size, which is defined as follows: Definition 2 ( -bracket and -bracket number). A size-N -bracket is a bracket {g i 1 , g i 2 } N i=1 where g i 1 (g i 2 ) is a function mapping any policy π and trajectory τ to R such that for all i ∈ [N ], g i 1 (π, •) -g i 2 (π, •) 1 ≤ for any policy π, and for any f ∈ F, there must exist an i ∈ [N ] such that g i 1 (π, τ H ) ≤ P π f (τ H ) ≤ g i 2 (π, τ H ) for all τ H , π. The -bracket number of F, denoted by N F ( ), is the minimum size of such an -bracket. Although P π f is an (|O||A|) H -dimensional vector, its log -bracket number will not scale exponentially with H because P π f is Lipschitz continuous with respect to {M o,a,h;f , q 0;f }, whose dimension only scales polynomially with H. In Appendix G we show that the bracket number of F can be upper bounded by the covering number of {M o,a,h;f , q 0;f } in linear PSRs, and we provide exact upper bounds for tabular PSRs and various POMDPs.

3. ALGORITHM: CRANE

The statistical hardness of learning POMDPs due to the partial observability is well-known in the literature [32] , which also exists in PSR learning since PSRs are a more general model. In addition, existing algorithms [29, 13, 37] for learning sub-classes of POMDPs require the existence of latent states since they directly put models on T and O. Thus, their methods are not applicable to PSRs. That said, the existence of predictive states indeed implies the low-rank linear structure of PSRs. The trajectory probability decomposition (3) further suggests that we are able to capture a PSR completely as long as we can learn the predictive operator matrices {M o,a,h } o∈O,a∈A,h∈[H-1] and the initial predictive state q 0 efficiently. Therefore, inspired from the success of maximum loglikelihood estimation (MLE) in learning weakly-revealing POMDPs [37] , we propose a new MLEbased PSR learning algorithm to learn these parameters as follows. CRANE. Intuitively, our algorithm is an iterative MLE algorithm with optimism, where in each iteration we use MLE to estimate the model parameters based on the previously collected trajectories and choose an optimistic policy to execute. We call it OptimistiC PSR leArniNg with MLE (CRANE). CRANE mainly consists of three steps, whose details are shown in Algorithm 1: • Optimism: Since we consider the online learning problem, the unknown model dynamics force us to deal with the exploration-exploitation tradeoff. Here we utilize the Optimism in the Face of Uncertainty principle and choose an optimistic estimation f k of the model parameters from the constructed confidence set B k . Our policy π k is the optimal policy under f k , ensuring that V π k f k ≥ V * with high probability. • Trajectory collection: For each step h ∈ [H -1] + and each action sequence u a,h+1 in U A,h+1 , we collect a trajectory τ k,u a,h+1 ,h H by executing the policy π k,u a,h+1 ,h = π k 1:h-1 •Unif(A)•u a,h+1 (and uniform policy afterwards if the episode has not ended). This helps us obtain the required information for estimating each predictive operator matrix M o,a,h and initial predictive state q 0 . • Parameter estimation with MLE: Finally we need to update the confidence set with newly collected trajectories. We achieve this by implementing MLE on all the collected trajectories with slackness β, as shown below: B k+1 ← f ∈ F : (π,τ H )∈D log P π f (τ H ) ≥ max f ∈F (π,τ H )∈D log P π f (τ H ) -β . For example, the likelihood P π f (τ H ) is specified by (3) if we model {M o,a,h , q 0 } as f . In POMDPs, if we model {T h , O h , µ 1 } as f , the likelihood is specified by marginalizing over latent states. The slackness β is chosen appropriately such that the true parameters f * ∈ B k+1 with high probability, which in turn guarantees optimism in the first step. Comparison with Liu et al. [37] . The main difference is that our algorithm can allow more general models. For example, in PSRs, we can generally take {M o,a,h , q 0 } that depends on only observable quantities as a model f . On the other hand, Liu et al. [37] attempts to put models on {T h , O h } that involve latent states. The practical benefit of modeling {M o,a,h , q 0 } is we don't need to specify the latent space while Liu et al. [37] needs to do. Since we often do not have good prior knowledge about latent states, our algorithm would be more practical in this scenario. Due to the generality of our algorithm, we can capture more models such as m-step decodable POMDPs and low-rank POMDPs as we will see in the following sections. ). end for Update confidence set: Compute B k+1 via (5) . end for

Algorithm 1 CRANE

Input: confidence parameter β. Initialize B 1 ← F, D = ∅. for k = 1, • • • , T do Optimistic Planning: (f k , π k ) ← arg max f ∈B k ,π V π f . Collect samples: for h ∈ [H -1] + , u a,h+1 ∈ U A,h+1 do Execute π k,u a,h+1 ,h = π k 1:h-1 • Unif(A) • u a,

4. MAIN RESULT

Next, we present the regret analysis for CRANE. We will utilize the fact that the core matrix K h is full-rank. However, matrix rank is vulnerable to estimation errors since small perturbations might change the rank drastically. Here we assume the 1 norm of K † h is upper bounded, which is a more robust assumption than K h being full rank. Note a similar assumption is often imposed in the PSR literature [27, Appendix B.4] . Assumption 1 (α-regularity of PSRs.). Assume that there exists α > 0 such that for any h ∈ [H -1] + , we have K † h 1 →1 ≤ 1/α. Remark 1. K † h 1 →1 can be upper bounded by d PSR,h /σ min (K h ). In the literature of POMDPs, many works [29, 13, 37] assume a similar condition called α-weakly revealing condition. That is, the minimal singular value of the observation matrix or the multi-step observation matrix is lower bounded by α. Assumption 1 can be regarded as a generalization of such weakly revealing condition in PSRs by viewing core histories as the "states" and tests u ∈ U h as the "observations" in PSRs. In addition, to simplify analysis, we assume all one-step observations o ∈ O belong to U H . This does not harm the generality of our model since augmenting the core test set is always feasible and adding all one-step observations will at most increase |U A | by one. Assumption 2. For all o ∈ O, we assume that o ∈ U H . This assumption immediately implies that m o,H = e o,H , i.e., it is a one-hot vector which indexes the observation o in U H . To see that, note that o ∈ U H implies that the predictive state q τ H-1 contains the probability P(o|τ H-1 ). Thus, when m o,H = e o,H , we have m o,H q τ H-1 = P(o|τ H-1 ). Therefore when Assumption 2 holds, we can assume that for all models induced by F, we have m o,H;f = e o,H for all o without loss of generality. Furthermore, we impose constraints on the function class F as follows: Assumption 3. Assume the function class F satisfies the following conditions: (1) Realizability: We have f * ∈ F, (2) Regularity: For all f ∈ F and h ∈ [H -1] + , we have K † h;f 1 →1 ≤ 1/α, : For all f ∈ F, the model dynamics induced by f is a valid PSR with core test set {U h } h∈[H] , i.e., the trajectory probability P π f should be a valid distribution for any policy π and satisfies the definition of PSRs. The last two constraints (2), (3) in Assumption 3 can be easily satisfied by eliminating those functions which do not satisfy the regularity or validity. Notice that the system dynamics in (3) only utilizes the inner product of m (o,a,u),h;f and q τ h-1 ;f , and q τ h-1 ;f lives in the column space of K h-1;f (i.e., (2)), which implies there is redundancy in the choice of m (o,a,u),h;f given the model P π f . Next we show that among these possible m (o,a,u),h;f , we can always find one that lies in the column space of K h-1;f . More specifically, if we replace any m (o,a,u),h;f with its projection on the space spanned by {q τ h-1 ;f } τ h-1 (which is exactly the column space of K h-1;f ), the resulting model dynamics will remain the same. Formally, we can show given any P π f , there exists a set of {m (o,a,u),h;f } o∈O,a∈A,u∈U h+1 ,h∈[H-1] such that m (o,a,u),h;f belongs to the column space of K h-1;f (the proof is deferred to Appendix H). Therefore, in the following, we let M o,a,h;f consist of such m (o,a,u),h;f without loss of generality.

Model

Core test set With the above assumptions, we have the following theorem, which characterizes the sample complexity of CRANE to learn a near-optimal policy, whose proof is deferred to Appendix I: Theorem 1 (Sample complexity). Under Assumption 1,2,3, for any δ ∈ (0, 1], > 0, if we choose U h d PSR log N F ( ) tabular POMDPs (O × A) m-1 × O ≤ |S| poly(|O|, T = 1/ 2 × poly(d PSR , |U A |, 1/α, log N F ( b ), H, |A|, log |O|, log(1/δ)), and set β = c log(N F ( b )T H|U A |/δ), then with probability at least 1 -δ we have V π ≥ V * -, where π is the uniform mixture of the output policies, i.e., π = Unif({π k } T k=1 ). Theorem 1 indicates that the complexity of CRANE only depends polynomially on the PSR rank d PSR , the size of U A,h , the 1 -norm of the pseudoinverse of the core matrix 1 α , the log bracket number of function classes log N F ( b ), H and |A|. CRANE avoids direct dependency on poly(|O|) and our sample complexity remains the same even if the observation parts in core test set U h is large. Via the relationship between POMDPs and PSRs, Theorem 1 can be applied to m-step weaklyrevealing tabular POMDPs (including undercomplete POMDPs [29] and overcomplete POMDPs [37] ), m-step weakly-revealing low-rank POMDPs [47] , m-step weakly-revealing linear POMDPs and m-step decodable POMDPs [13] , which we will elaborate on in Section 5 and Appendix A. Our sample complexity in Theorem 1 depends on the upper bound of K † h 1 →1 , i.e., 1/α, which is not avoidable in worst-case. We state the lower bound formally in Appendix K. Proof techniques of Theorem 1. The existing analysis for POMDPs does not apply to PSRs since we do not assume latent states in PSRs, let alone the emission matrix and transition matrix. In our proof, we utilize the linear nature of PSRs and leverage the core matrix K h to bound the error propagation induced by the product of predictive operator matrices, i.e., H-1 h=1 M o h ,a h ,h in (3). This key step enables us to bound the difference of model dynamics P π f and P π (i.e., P π f ) by the estimation error of M o h ,a h ,h and q 0 , and thus obtain a polynomial bound on the total suboptimality. Comparison with existing works on PSRs. As far as we know, there are only two works that tackle provably efficient RL for PSRs. Jiang et al. [26] shows a polynomial sample complexity result in reactive POMDPs where optimal value functions only depend on current observations. Later, Uehara et al. [45] shows a favorable sample complexity result without this assumption. However, their result is an agnostic-type result that depends on (|O||A|) M when competing with M -memory policies. Thus, to compete with the globally optimal policy, their results do not imply a polynomial sample complexity bound.

5. EXAMPLES

In this section, we illustrate the sample complexity of CRANE to learn m-step weaklyrevealing/decodable tabular POMDPs and low-rank POMDPs. We defer more details (including the concrete function classes we utilize to satisfy Assumption 3 and comparison with existing works) and other examples including tabular PSRs and m-step weakly-revealing/decodable linear POMDPs to Appendix A. Note that we can identify the minimum core test size and the bracketing number of related models, which is summarized in Table 2 and the proof is deferred to Appendix E and G.

5.1. m-STEP WEAKLY-REVEALING TABULAR POMDPS

We first focus on m-step weakly-revealing tabular POMDPs [37] defined as follows. Definition 3 (m-step weakly-revealing Tabular POMDPs). Define the m-step emission matrix O h,m ∈ R |A| m-1 |O| m ×|S| for any h ∈ [H -m + 1] as follows: (O h,m ) (a,o),s := P(o h:h+m-1 = o|s h = s, a h:h+m-2 = a), ∀(a, o) ∈ A m-1 × O m , s ∈ S. When rank(O h,m ) = |S|, POMDPs are referred as m-step weakly-revealing POMDPs. This assumption implies that the observations leak at least some information about the states so that we can learn the POMDPs efficiently. Substituting the results in Table 2 into Theorem 1, we can obtain the sample complexity for learning m-step weakly-revealing tabular POMDPs as follows. Corollary 1 (Sample complexity for m-step weakly-revealing tabular POMDPs). Suppose the POMDP is m-step weakly-revealing and we execute CRANE with β = c log(N F ( b )T H|U A |/δ) up to the step H -m where U h and N F ( b ) are specified in Table 2 . Then under Assumption 1, for any δ ∈ (0, 1], > 0, if we choose T = 1/ 2 × poly(d PSR , |A| m , 1/α, |O|, |S|, H, log(1/δ)), then with probability at least 1 -δ we have V π ≥ V * -. 5.2 m-STEP WEAKLY-REVEALING LOW-RANK POMDPS Next, we consider m-step weakly-revealing low-rank POMDPs. We first define low-rank POMDPs to be a special subclass of POMDPs. Definition 4 (Low-rank POMDPs). Suppose the transition kernel T h has the following low-rank form for all h ∈ [H]:  T h (s |s, a) = (ψ h (s )) φ h (s, a), T = 1/ 2 × poly(d trans , |A| m , 1/α, log |F|, H, log |O|, log(1/δ)), then with probability at least 1 -δ we have V π ≥ V * -. Notice that the sample complexity only depends on d trans rather than |S| for low-rank POMDPs.

5.3. m-STEP DECODABLE TABULAR/LOW-RANK POMDPS

Next, we instantiate our result on m-step decodable POMDPs [13] defined as follows. Definition 5 (m-step decodable POMDPs). There exist unknown decoders {φ dec,h } m≤h≤H such that for every reachable trajctory τ H , we have s h = φ dec,h (z h ) for all m ≤ h ≤ H where z h = ((o, a) h-m+1:h-1 , o h ). This definition says that we can decode the current state with m-step history. Surprisingly, Table 2 shows that m-step decodable POMDPs can be formulated as PSRs just like weakly-revealing POMDPs, which leads to the following corollary: Corollary 3 (Sample complexity for m-step decodable POMDPs). • In m-step decodable tabular POMDPs, the same statement in Corollary 1 holds. • In m-step decodable low-rank POMDPs, the same statement in Corollary 4 holds. Remark 2 (Tabular PSRs and linear POMDPs). We can instantiate our result on tabular PSRs. The sample complexity is polynomial in all parameters: d PSR , |O|, |A|, |U|, H, 1/ , 1/α, log(1/δ). Here, we leverage the observation of covering numbers after Definition 2. We also consider linear POMDPs where latent transitions and emissions follow linear structures. While similar models are considered [7, 47] , our result is still new in that our model is more general. The details are deferred to Section A.

6. CONCLUSION

We consider PAC learning in PSRs that represent states as a vector of prediction about future events conditioned on histories. We propose CRANE and show polynomial sample complexities when we compete with the globally optimal policy.  P π f (τ H ) = e (o H-m+1:H ,a H-m+1:H-1 ),H-m+1 • H-m l=1 M o l ,a l ,l q 0 × π(τ H ). Note that here U H does not contain the observation space. Nevertheless, we can replace (17) with V π k f k -V π k ≤ H τ H-m H-m h=1 M k o h ,a h ,h • q k 0 - H-m h=1 M o h ,a h ,h • q 0 1 × π k (τ H-m ), and follow the same proof to show that Theorem 1 still holds even though Assumption 2 is not satisfied. Remarks about Corollary 1. By executing CRANE to the step H -m we mean that when collecting trajectories, we only execute π k,u a,h+1 ,h and collect τ k,u a,h+1 ,h H for h ∈ [H -m] + , u a,h+1 ∈ U A,h+1 . Since we have d PSR ≤ |S|, we can obtain that the sample compleixty will not be larger than poly(|S|, H, |A| m , 1/α, 1/ , |O|, log(1/δ)) from Corollary 1. This indicates that CRANE is able to achieve polynomial sample complexity for m-step weakly-revealing tabular POMDPs. Comparison with [37] . In m-step weakly revealing tabular POMDPs, CRANE is similar to the algorithm OMLE proposed in [37] and their analysis leads to a sample complexity similar to Corollary 1. However, their algorithm has a pre-processing step on the emission matrix O h,m while we have a step to formulate POMDPs into PSRs for pre-processing, thus the algorithm is still different. Further, they assume an upper bound on O † h,m 1 →1 while we assume K † h 1 →1 ≤ 1/α in Assumption 1. For tabular POMDPs, our assumption is slightly stronger since we have K h-1 = O h,m S h where (S) s,τ l h-1 = P(s|τ l h-1 ) and thus σ min (K h-1 ) ≤ d PSR σ min (O h,m ). That said, the analysis and algorithm in [37] is specially tailored to m-step weakly reavling POMDPs and relies on the existence of latent states. In contrast, our algorithm and analysis can be applied to any PSR models including m-step decodable POMDPs. Comparison with [34] . [34] deals with latent MDPs but they require either proper initialization or other assumptions including Sufficient Tests, Sufficient Histories, strong separation of the MDPs and reachability of the states. In contrast, we show in Appendix E that LMDP with Sufficient Tests can be formulated into a (l + 1)-step weakly-revealing POMDP, therefore CRANE is capable of tackling LMDP with sample complexity 1/ 2 × poly(M, |S|, |A| l , 1/α, H, log(1/δ)) under only Sufficient Tests and Assumption 1. In addition, the sample complexity in [34, Theorem 3.5] will scale with the initialization error while CRANE circumvents such dependence completely. A.2 m-STEP WEAKLY-REVEALING LOW-RANK POMDPS Next, we supplement the details about m-step weakly-revealing low-rank POMDPs. Core test sets and function classes. For weakly-revealing low-rank POMDPs, we can still choose U h to be the set of all m-step futures (O × A) m-1 × O due to the weakly-revealing property. For the function class F, we let it model the feature vectors, emission matrix and initial state distribution, i.e., {Φ f , Ψ f , O f , µ f } f ∈F where Φ f : S × A × [H] → R dtrans , Ψ f : S × [H] → R dtrans , O f : S × O × [H] → [0, 1], µ f : S → [0, 1] such that φ h;f (s, a) = Φ f (s, a, h), ψ h;f (s) = Ψ f (s, h), O h;f (o|s) = O f (s, o, h), µ 1;f (s) = µ f (s). Here F can be infinite. Denote the ∞ -norm covering number of Φ f , Ψ f , O f , µ f by Y Φ ( ), Y Ψ ( ), Y O ( ), Y µ ( ). Then we have log N F ( ) ≤ log Y Φ ( LR /d trans ) + log Y Ψ ( LR /d trans ) + log Y O ( LR ) + log Y µ ( LR ), where LR := O( /(|O| H+2 |A| H )). The proof is deferred to Appendix G. To make Assumption 3 hold, we only need to assume the feature vector classes satisfies realizablity: Assumption 4. Suppose that there exsits f * ∈ F such that for all s ∈ S, a ∈ A, h ∈ [H] we have φ h (s, a) = Φ f * (s, a, h), ψ h (s) = Ψ f * (s, h). Then we can lift low-rank POMDPs to PSR formulation, and then pre-process it to satisfy Assumption 3. 2 . Then under Assumption 1 and 4, for any δ ∈ (0, 1], > 0, if we choose

Remarks about

T = 1/ 2 × poly(d trans , |A| m , 1/α, log Y Φ ( LR /d trans ), log Y Ψ ( LR /d trans ), log Y O ( LR ), log Y µ ( LR ), H, log |O|, log(1/δ)), then with probability at least 1 -δ we have V π ≥ V * -. Comparison with [47] . In Corollary 4, we do not specify the function class and keep the bracket number to facilitate the comparison with [47] . [47] also tackles the online learning problem of mstep weakly-revealing low-rank POMDPs and our sample complexity only has an additional log |O| factor compared to theirs. However, the model they have considered is less general than ours in the sense that they require the feature vectors φ h (s, a) to be a d trans -dimensional probability distribution to guarantee the existence of some bottleneck variables. Besides, their analysis depends on some possibly complicated assumptions to recover the bottleneck variable like "past sufficiency". In contrast, CRANE only requires K † h 1 →1 to be upper bounded and does not assume the existence of bottleneck variables. Comparsion with [45] . They show favorable sample complexity results in weakly-revealing lowrank POMDPs. However, their sample complexity results are quasi-polynomial. On the other hand, our results are polynomial while we have an additional log |O| factor.

A.3 TABULAR PSRS

Notice that in Theorem 1 the log bracket number log N F ( b ) is abstract. Here we consider tabular PSRs as a speical case to provide an intuition how large the bracket number will be in general. In tabular PSRs we directly use {M o,a,h , q 0 } o∈O,a∈A,h∈[H-1] as the parameters of F and assume for all f ∈ F we have max o∈O,a∈A,h∈[H-1],u∈U h+1 m (o,a,u),h;f ∞ ≤ 1, q 0;f ∞ ≤ 1 Then following the arguments in Appendix G, we have log N F ( ) ≤ O(|U| 2 |O||A|H 2 log(H|O||A||U A ||U|/(α ))). By substituting the above results into Theorem 1, we have the following corollary which characterizes the sample complexity to learn tabular PSRs: Corollary 5 (Sample complexity for tabular PSRs). Execute CRANE with β = c log(N F ( b )T H|U A |/δ) where N F ( b ) are specified in (6) . Then under Assumption 1,2 and 3, for any δ ∈ (0, 1], > 0, if we choose T = 1/ 2 × poly(d PSR , |U A |, 1/α, |U|, |O|, |A|, H, log(1/δ)), then with probability at least 1 -δ we have V π ≥ V * -. From Corollary 5 we can see that CRANE is capable of learning tabular PSRs efficiently, with sample complexity polynomial in all relevant parameters. Here, though we have a poly(|U|) dependency in learning PSRs (since our model parameters M a,o have degree of freedom scaling in poly(|U|)), we will show that we would not incur poly(|U|) in the log -bracket number when PSRs are POMDPs. This is because we can directly model the latent transition and omission distribution when we know it is a POMDP.

A.4 m-STEP WEAKLY-REVEALING LINEAR POMDPS

In low-rank POMDPs the bracket number is still somehow abstract because we do not specify the function class {Φ f , Ψ f , O f , µ f }. Next we consider linear POMDPs and illustrate a more concrete result. Here we assume that linear POMDPs possess a linear structure in both the transition kernel and emission matrix. More formally, we can generalize the linear MDPs in [48] and define linear POMDPs as follows: Definition 6 (Linear POMDPs). A POMDP is linear with respect to the given feature vectors We further define the function class to be {f {φ h (s, a) ∈ R d1 , ψ h (s) ∈ R d2 , φ h (s) ∈ R d3 , ψ h (o) ∈ R d4 , φ(s) ∈ R d5 } s∈S,a∈A,o∈O,h∈[H] where φ h (s, a) ∞ ≤ 1, ψ h (s) ∞ ≤ 1, φ h (s, a) ∞ ≤ 1, φ h (o) ∞ ≤ 1, φ(s) ∞ ≤ 1 for all s ∈ S, a ∈ A, o ∈ O, h ∈ [H] if there exists a set of matrices {B * h,1 ∈ R d2×d1 , B * h,2 ∈ R d4×d3 } h∈[H] where B * h,1 ∞,∞ ≤ 1, B * h,2 ∞,∞ ≤ 1 for all h ∈ [H] and θ * ∈ R d5 where θ * ∞ ≤ 1 such that for any s, s ∈ S, a ∈ A, o ∈ A, h ∈ [H] we have T h (s |s, a) = (ψ h (s )) B * h,1 φ h (s, a), O(o|s) = (ψ h (o)) B * h,2 φ h (s), µ 1 (s) = (θ * ) φ(s). Denote d lin = max{d 1 , d 2 , d 3 , d 4 , d 5 }. Notice that since T h (s |s, a) = (ψ h (s )) B * h,1 φ h (s, = (B h,1 ∈ R d2×d1 , B h,2 ∈ R d4×d3 , θ ∈ R d5 ) : B h,1 ∞,∞ ≤ 1, B h,2 ∞,∞ ≤ 1, θ ∞ ≤ 1, ∀h ∈ [H]} such that for any o ∈ O, a ∈ A, h ∈ [H -m] T h;f (s |s, a) = (ψ h (s )) B h,1 φ h (s, a), O h;f (o|s) = (ψ h (o)) B h,2 φ h (s), µ 1;f (s) = θ φ(s). This enables us to bound N F ( ) as in Table 2 . Note that this function class satisfies realizability and we can pre-process it to make Assumption 3 hold. Finally, we assume m-step weakly-revealing property. In this case, we can still choose the same U h as in tabular POMDPs. Using the above models and formulating POMDPS into PSRS for the preprocessing step to satisfy Assumption 3, we can run CRANE. The sample complexity for learning linear POMDPs will scale with d lin rather than poly(|O|, |S|) as follows. Corollary 6 (Sample complexity of m-step weakly-revealing linear POMDPs). Suppose the linear POMDP is m-step weakly-revealing and we execute CRANE with β = c log(N F ( b )T H|U A |/δ) up to the step H -m where U h and N F ( b ) are specified in Table 2 . Then under Assumption 1, for any δ ∈ (0, 1], > 0, if we choose T = 1/ 2 × poly(d lin , |A| m , 1/α, H, log |O|, log(1/δ)), then with probability at least 1 -δ we have V π ≥ V * -. Comparison with [7] . From Corollary 6, we can see that the linear structure helps us circumvent the polynomial scaling with |O| and |S|. Core test sets and function classes. Like m-step weakly-revealing POMDPs, m-step decodable POMDPs can be formulated as PSRs where core tests are m-step futures and the PSR rank is |S| in the tabular case and d trans in low-rank POMDPs. Intuitively, this is proved by the observation that m-step futures can decode the latent state m-step ahead, i.e., s m+h by treating "histories" in the definition as "futures". In Appendix E we have a more detailed discussion. We also utilize the same function classes as m-step weakly-revealing POMDPs for m-step decodable POMDPs. Remarks about Corollary 3. Note that the discussion about m-step weakly-revealing linear POMDPs also holds for m-step decodable POMDPs, therefore we can extend Corollary 3 to the following corollary: Corollary 7 (Sample comlexity for m-step decodable POMDPs). • In m-step decodable tabular POMDPs, the same statement in Corollary 1 holds. • In m-step decodable low-rank POMDPs, the same statement in Corollary 4 holds. • In m-step decodable linear POMDPs, the same statement in Corollary 6 holds. Comparison with [13] . [13] works on m-step decodable tabular POMDPs and show sample complexity polynomial in |S|, H, |A| m , 1/ , log(1/δ) and log covering number of a value function class. They also provide a result on m-step decodable low-rank POMDPs where the sample complexity scales with d trans rather than |S|. However, there are some differences between their results and Corollary 7. First, the log covering number of the value function class in their results will typically scale with poly(|O| m ). Our results, on the other hand, only scale with poly(|O|) since the log bracket number of our function classes only scales with poly(|O|). Secondly, the analysis in [13] does not require the regularity-type assumption (Assumption 1). This is because their algorithm is tailored to m-step decodable POMDPs. The lower bound in Theorem 3 has shown that the scaling with the regularity parameter 1/α is inevitable in PSRs, highlighting the necessity of such regularity in general.

B RELATED WORKS

We discuss related works to our paper in this section. PSRs and its learning algorithm PSRs represent states as a vector of predictions about future events [36, 41, 38, 21, 44, 19] . Importantly, compared to well-known models of dynamical systems like HMMs that postulate latent state variables that are never observed, we do not need to refer to latent state variables and every definition relies on observable quantities. While PSRs were originally introduced in the tabular setting, PSRs can be extended to the non-tabular setting using conditional mean embeddings [4] . Using data obtained by exploratory open-loop policies such as uniform policies, Boots et al. [6; 4] , Zhang et al. [49] proposed a learning algorithm for dynamics by leveraging spectral learning [33, 23, 28] . Later, Hefny et al. [22] pointed out an insightful connection between spectral learning and supervised learning (more specifically, instrumental variable regression when histories are instrumental). Based on this viewpoint, Hefny et al. [22] proposed a two-stage regression learning algorithm. Compared to these settings, our setting is significantly challenging. This is because their goal is learning system dynamics with exploratory offline data while we want to learn the optimal policy when we don't have access to such exploratory data.

Provably efficient RL for POMDPs and PSRs.

Seminal works [31, 14] obtained A H -type sample complexity bounds for POMDPs. We can avoid exponential dependence with more structural assumptions. Recently, there is a growing body of literature that discusses provably efficient RL in the online setting under various structures. In the tabular setting, one of the most standard structural assumptions is an observability (i.e., weakly-revealing) assumption, which implies that observations retain information about latent states. Under observability and various additional assumptions, in Azizzadenesheli et al. [3] , Guo et al. [20] , Kwon et al. [34] , favorable polynomial sample complexities are obtained by leveraging the spectral learning technique [23] . Later, Jin et al. [29] , Liu et al. [37] improve these results and obtain polynomial sample complexity results under only observability assumptions. Golowich et al. [17; 18] develop algorithms with quasi-polynomial sample and computational complexity under observability properties. In the non-tabular POMDP setting, several positive results are obtained. One of the most investigated models is linear quadratic gaussian (LQG), which is a partial observable version of LQRs. Lale et al. [35] , Simchowitz et al. [40] proposed sub-linear regret algorithms. Polynomial sample complexities are obtained on other various POMDP models such as M-step decodable POMDPs [13] where we can decode the latent state by m-step back histories (when m = 1, it is Block MDP), weaklyrevealing linear-mixture type POMDPs [7] where emission and transition are modeled by linear mixture models, weakly-revealing low-rank POMDPs [45] where latent transition have low-rank structures. Our proposed algorithm can capture all of the abovementioned models except for LQG. There are few works that discuss strategic exploration in PSRs. None of them obtain polynomial sample complexity results for learning approximate globally optimal policies [26, 45] . For details, refer to Section 4.

C NOTATIONS

We sum up the notations in PSRs in Table 3 .

D EXPRESSIVITY OF PSRS

In this section, we will construct a sequential decision making process to illustrate the superior expressivity of PSRs with respect to POMDPs. In short, we will show that if we formulate the process into a POMDP, the number of latent states we need can be exponentially larger the core test set size in PSRs. The construction leverages existing results in perfect matching polytope and largely follows the arguments in [1] . First, let n be even and K n be the complete graph on n vertices. Consider a vector x ∈ R ( n 2 ) that associates a weight to each edge and we denote its entry by x u,v where u = v ∈ [n] are the vertices. Let 1 M ∈ R ( n 2 ) denote the edge-indicator vector for a subset of edges M. Then [12] shows that the convex hull of all edge-indicator vectors corresponding to a perfect match, which we also call the predictive operator matrix perfect matching polytope, can be expressed with a number of constraints as follows: P n := conv 1 M ∈ R ( n 2 ) |M is a perfect matching in K n = x ∈ R ( n 2 ) : x ≥ 0; ∀v : u x u,v = 1; ∀U ⊂ [n] and |U | is odd : v / ∈U u∈U x u,v ≥ 1 . There are V := n!/(2 n/2 (n/2)!) vertices in P n and the number of constraints is C := 2 Ω(n) . We denote the vertices by {v 1 , • • • , v V } and the constraints by c 1 , • • • , c C . We further add another dimension to v i (i ∈ [V ]) to account for the offsets in the constraints and obtain vectors v i ∈ R ( n 2 )+1 (i ∈ [V ] ). Then we have c i , v j ≥ 0 for all i ∈ [C], j ∈ [V ]. Now we can define the slack matrix for the polytope P n to be Z ∈ R C×V + where Z i,j = c i , v j . Notice that the rank of Z is O(n 2 ). However, since P n has extension complexity 2 Ω(n) [39] and the extension complexity of a polytipe is the non-negative rank of its slack matrix [15] , we know the non-negative rank of Z is at least 2 Ω(n) . Now we can construct our sequential decision making process. Suppose for the step 1 ≤ h ≤ H -1, the process behaves according to a POMDP with state space S , action space A, observation space O, initial state distribution µ 1 , emission matrix O h and transition kernel T h . At step h = H though, the one-step system dynamics D H-1 is given by associating each pair (o H-1 , a H-1 ) with a constraint c i and each future test t ∈ O (which is one-step observation now) with a vertex v j such that for any history τ H-1 that ends with (o H-1 , a H-1 ) we have P(t|τ H-1 ) = c i , v j V k=1 c i , v j . Now we fix a history τ H-2 with length H -2 and consider the matrix DH-1 ∈ R |O|×(|O||A|) where the rows are indexed by the test t ∈ O, the columns are indexed by the history (τ H-2 , o, a) for all o ∈ O, a ∈ A and ( DH-1 ) t,(τ H-2 ,o,a) = P(t|τ H-2 , o, a). Since the non-negative rank is preserved under positive diagonal rescaling [8] , we know the non-negative rank of D H-1 is at least 2 Ω(n) . Then for the above sequential process, if we formulate it into a POMDP with state space S, then we have P(t|τ H-2 , o, a) = s H ∈S P(t|s H )P(s H |τ H-2 , o, a).

Notice that for a row-stochatic matrix D

H-1 , the non-negative rank is equal to the smallest number of factors we can use to write D H-1 = RS where both R, S are row-stochastic [8] . This implies Figure 2 : For any POMDP, the system dynamics matrix D h can always be factorized using the latent states. This factorization implies that the rank of D h is no larger than the number of latent states, which implies that POMDP is a linear PSR with rank at most equal to the number of latent states. Note that here D h,1 and D h,2 both contains non-negative entires. In contrast, from Figure 1 , the low-rank factorization of D h in PSR can have negative entries (i.e., m and v can have negative entries). that we must have |S| not smaller than the non-negative rank of D H-1 , therefore we have |S| ≥ 2 Ω(n) . On the other hand, if we formulate the above process into a PSR, at step h = H, since the rank of D H-1 is equal to the rank of Z, we know the rank of D H-1 is not larger than O(n 2 ), which implies that there exists a core test set U H whose size is not larger than O(n 2 ). When 1 ≤ h ≤ H -1, notice that for any test t = {o h:H , a h:H-1 } and history τ h-1 we have P(t|τ h-1 ) = s h ∈S P(s h |τ h-1 )(P(t h:H-1 |s h )P(o H |o H-1 , a H-1 )), where t h:H-1 = {o h:H-1 , a h:H-2 }. Notice that P(t h:H-1 |s h )P(o H |o H-1 , a H-1 ) only depends on t and P(s h |τ h-1 ) only depends on τ h-1 . This implies that there exists a core test set U h whose size is not larger than |S | for all 1 ≤ h ≤ H -1. Therefore, the core test set size of the PSR can be smaller than max{O(n 2 ), |S |}. This shows that PSRs can express this sequential decision making process exponentially more efificient than POMDPs.

E EXAMPLES OF PSRS

In this section we present the proofs of Lemma 2 and Lemma 3, and then formulate m-step weaklyrevealing POMDPs, m-step decodable POMDPs and low rank POMDPs into PSRs and analyze their core test set and minimum core test set size.

E.1 PROOFS OF LEMMA 2 AND LEMMA 3

We first prove Lemma 2. Consider the one-step system dynamics D h shown in Figure 2 whose rows are indexed by all possible future tests t h+1 and columns are indexed by all histories τ h at h-th step. Each entry of D h is the successful probability of the test, i.e., (D h ) t h+1 ,τ h = P(t h+1 |τ h ). Since we know P(t h+1 |τ h ) = s h+1 ∈S P(t h+1 |s h+1 )P(s h+1 |τ h ) (where we also define P(s h+1 |τ h ) = 0 for unreachable τ h ), we can decompose D h into the product of D h,1 and D h,2 as in Figure 2 , where (D h,1 ) t h+1 ,s h+1 = P(t h+1 |s h+1 ) and (D h,2 ) s h+1 ,τ h = P(s h+1 |τ h ). This implies that the rank of D h is not larger than |S|, which proves that it is a linear PSR with rank no larger than |S|.

Now we prove Lemma 3. Consider any h ∈ [H], let q τ

h-1 = [P(o|τ h-1 )] o∈O . Then the belief state of the POMDP s τ h-1 = [P(s h |τ h-1 )] s h ∈S can be expressed as: s τ h-1 = O † h q τ h-1 . Here, we use O † h O h is an |S| × |S| identity matrix, which is verified by the assumption. Then for any test t = (o h:h+W , a h:h+W -1 ), we know P(t|τ h-1 ) = m t,h s τ h-1 where m t,h = O h+W (o h+W |•) h+W -1 l=h T l,a l diag(O l (o l |•)). where O h (o|•) ∈ R |S| is a vector whose s-th entry is O h (o|s) and T l,a l is a |S| × |S| matrix with entry (T l,a l ) s ,s = T l (s |s, a l ). Therefore we have P(t|τ h-1 ) = m t,h , q τ h-1 where m t,h = (m t,h O † h ) . Thus we have shown that the probability of any test t is a linear combination of the probabilities of the tests o ∈ O (the linear combination weights m t,h only depends on test but is independent of history). This indicates that O is a core test set for 1-step weakly-revealing POMDPs.

E.2 m-STEP WEAKLY-REVEALING POMDPS

Recall the definition of the m-step emission matrix O h,m ∈ R |A| m-1 |O| m ×|S| for any h ∈ [H - m + 1] is as follows: (O h,m ) (a,o),s := P(o h:h+m-1 = o|s h = s, a h:h+m-2 = a), ∀(a, o) ∈ A m-1 × O m , s ∈ S. Then m-step weakly revealing condition [37] means that the rank of O h,m is |S| for all h ∈ [Hm + 1]. From Lemma 2, we know that d PSR ≤ |S|. In addition, the following lemma suggests that a general core test set for m-step weakly-revealing POMDPs is the set of all m-step futures: Lemma 5. When O h,m is full rank for all h ∈ [H -m + 1], the POMDP is a PSR with the core test set U h = O × (A × O) m-1 for all h ∈ [H -m + 1]. Proof. The proof is similar to 1-step weakly-revealing POMDPs. Consider any h ∈ [H -m + 1], let q τ h-1 = [P(u|τ h-1 )] u∈O×(A×O) m-1 . Then the belief state s τ h-1 = [P(s h |τ h-1 )] s h ∈S can be expressed as: s τ h-1 = O † h,m q τ h-1 . Then we know for any test t = (o h:h+W , a h:h+W -1 ), we know P(t|τ h-1 ) = m t,h s τ h-1 where m t,h = O h+W (o h+W |•) h+W -1 l=h T l,a l diag(O l (o l |•)). Recall that here O h (o|•) ∈ R |S| is a vector whose s-th entry is O h (o|s) and T l,a l is a |S| × |S| matrix with entry (T l,a l ) s ,s = T l (s |s, a l ).

Therefore we have P(t|τ

h-1 ) = m t,h , q τ h-1 where m t,h = (m t,h O † h,m ) . This indicates that U h = O × (A × O) m-1 is a core test set for all h ∈ [H -m + 1]. Notice that here we only show the core test set of m-step weakly-revealing POMDPs up to step H -m + 1. However, this is sufficient to charaterize the whole POMDP. From Lemma 4 we know that for any trajectory τ H , P π (τ H ) is one of the entries in H-m l=1 M o l ,a l ,l q 0 × π(τ H ). Therefore, with parameters {M o,a,h , q 0 } o∈O,a∈A,h∈[H-m] (which only depends on U h where h ∈ [H -m+1]) we can recover the POMDPs easily.

E.3 LATENT MDPS

Next we consider the Latent MDP (LMDP) model in [34] . Supoose there are M MDPs and each MDP m is characterized by (S, A, T m,h , R m,h , H, µ m ) where S is the common state space, A is the common action space, T m,h is the transition probability at step h of MDP m, R m,h : S × A × {0, 1} → [0, 1] is a probability measure for rewards at step h of MDP m that maps a state-action pair and a binary reward to a probability, H is the horizon and µ m is the initial state distribution of MDP m. At the start of every episode, one MDP m ∈ [M ] is randomly chosen with some probability w m . [34, Theorem 3.1] shows that with no further assumptions, learning an instance of the above LMDP requires at least Ω((|S||A|) M ) episodes at worst. A number of assumptions are considered to circumvent this lower bound and one of them is called Sufficient Tests. More specifically, for each step h ∈ [H -l + 1], consider all possible length-l action-reward-state sequences a h , r h , s h+1 , • • • , a h+l-1 , r h+l-1 , s h+l and denote the set of all such sequences by T h . Then suppose that the successful probability of T h under different MDPs given any s h ∈ S has rank M : Assumption 5 (Sufficient Tests, [34,  = [[P 1 (t|s h )] t∈T h , • • • , [P M (t|s h )] t∈T h ], then σ M (L s h ) ≥ α for all s h ∈ S with some α > 0. The following lemma indicates that LMDPs with Assumption 5 can be formulated into an (l + 1)step weakly-revealing POMDP and thus a PSR with the core test set being all (l + 1)-step futures: Lemma 6. Under Assumption 5, the LMDP can be formulated into an (l + 1)-step weakly-revealing POMDP. Proof. First notice that the LMDP can be formulated into a POMDP with state space S = S × {0, 1} × [M ] and observation space O = S × {0, 1}. At each step h, the latent state s h ∈ S is (s h , r h-1 , I) where s h is the current observed state, r h-1 is the reward of last step and I is the index of the underlying MDP. On the other hand, the observation o h is (s h , r h-1 ). Then for any h ∈ [H -l + 1], any latent state s h = (s h , r h-1 , I) and (l + 1)-step test t = (o t h , a t h , • • • , o t h+l ), we have P(t|s h ) = 1(o t h = (s h , r h-1 )) • P I (t|s h ), where t = (a t h , o t h+1 , • • • , o t h+l ) . Therefore, the (l + 1)-step emission matrix can be written as follows: O h,l+1 =          L s 1 h 0 0 0 • • • 0 0 L s 1 h 0 0 • • • 0 0 0 L s 2 h 0 • • • 0 0 0 0 L s 2 h • • • 0 . . . . . . . . . . . . . . . . . . 0 0 0 0 • • • L s |S| h          . Since the rank of L s h is M for any s h ∈ S, we know the rank of O h,l+1 is 2M |S| for all h ∈ [H -l + 1]. This implies that the POMDP satisfies the (l + 1)-step weakly-revealing condition.

E.4 m-STEP DECODABLE POMDPS

Recall the definition of m-step decodable POMDPs [13] is that there exist unknown decoders {φ dec,h } m≤h≤H such that for every reachable trajctory τ H , we have s h = φ dec,h (z h ) for all m ≤ h ≤ H where z h = ((o, a) h-m+1:h-1 , o h ). From Lemma 2, we know that d PSR ≤ |S|. Further, the following lemma suggests that a general core test set for m-step decodable POMDPs is the set of all m-step futures: Lemma 7. A m-step decodable POMDP is a PSR with the core test set U h = O × (A × O) m-1 for all h ∈ [H -m + 1]. Proof. Consider any h ∈ [H -m + 1], let q τ h-1 = [P(u|τ h-1 )] u∈O×(A×O) m-1 . Then for any test t = (o h:h+W , a h:h+W -1 ), when W ≤ m -1, we have for any length-(m -1 -W ) action sequence a, P(t|τ h-1 ) = u∈Ut,a P(u|τ h-1 ), where U t,a denotes the set of all length-m tests whose action sequence is (a h:h+W -1 , a) and the first W + 1 observations are o h:h+W . This implies P(t|τ h-1 ) = m t,h q τ h-1 where m t,h sets the entries corresponding to the tests in U t,a as 1 and the others as 0. When W > m -1, we denote t h:h+m-1 to be (o h:h+m-1 , a h:h+m-1 ) and t h+m to be (o h+m:h+W , a h+m:h+W -1 ). Then we have P(t|τ h-1 ) = P(t h:h+m-1 |τ h-1 )P(o h+m:h+W |(τ h-1 , t h:h+m-1 ); do(a h+m:h+W -1 )) = P(t h:h+m-1 |τ h-1 )P(o h+m:h+W |φ dec,h+m-1 (z h+m-1 ), a h+m-1 ; do(a h+m:h+W -1 )). Notice that P(o h+m:h+W |φ dec,h+m-1 (z h+m-1 ), a h+m-1 ; do(a h+m:h+W -1 )) only depends on t, therefore we have P(t|τ h-1 ) = m t,h q τ h-1 where m t,h sets the entry corresponding to (o h:h+m-1 , a h:h+m-2 ) as P(o h+m:h+W |φ dec,h+m-1 (z h+m-1 ), a h+m-1 ; do(a h+m:h+W -1 )) and the others as 0. This concludes our proof. Similar to the discussion for m-step weakly-revealing POMDPs, it is suffcient to show the core test set of m-step decodable POMDPs up to step H -m + 1.

E.5 LOW-RANK POMDPS

Next we consider low-rank POMDPs. Recall that for low-rank POMDPs, the transition kernel T h has the following low-rank form for all h ∈ [H]: T h (s |s, a) = (ψ h (s )) φ h (s, a), where ψ h : S → R dtrans and φ h : S × A → R dtrans . The next lemma indicates that for low-rank POMDPs, the minimum core test set size will be not larger than d trans , which can be potentially much smaller than |S|: Lemma 8. For any low-rank POMDP, its minimum core test set size will be not larger than d trans . Proof. First notice that we have for any test t and history τ h : P(t|τ h ) = [P(t|s h+1 )] s h+1 ∈S , s τ h . Besides, notice that from the low rank structure (4), we have for any s h+1 ∈ S, P(s h+1 |τ h ) = ψ h (s h+1 ) s h ∈S φ h (s h , a h )P(s h |τ h ).

This implies that

P(t|τ h ) = s h+1 ∈S P(t|s h+1 )ψ h (s h+1 ) • s h ∈S φ h (s h , a h )P(s h |τ h ) . This implies that the rank of the one-step system dynamics D h is not larger than d trans for all h ∈ [H]. Therefore we have d PSR ≤ d trans . F PROOF OF LEMMA 4 We first prove (4). Notice that we have P π (τ h )q τ h = [P(u|τ h )P π (τ h )] u∈U h+1 = (P(u|τ h )) u∈U h+1 P(o h |τ h-1 )π(a h |τ h-1 , o h )P π (τ h-1 ) = (P(o h , o(u)|τ h-1 ; do(a h , a(u)))) u∈U h+1 • π(a h |τ h-1 , o h )P π (τ h-1 ) = M o h ,a h ,h (q τ h-1 P π (τ h-1 ))π(a h |τ h-1 , o h ) = • • • = h l=1 M o l ,a l ,l q 0 × π(τ h ) = b τ h × π(τ h ), where the third step comes from the definition (1). In particular, for any trajectory τ H , we have P π (τ H ) = π(a H |τ H-1 , o H )(m o H ,H q τ H-1 )P π (τ H-1 ) = m o H ,H • H-1 h=1 M o h ,a h ,h • q 0 • π(τ H ). Tabular POMDPs. For tabular POMDPs where {T h;f , O h;f , µ 1;f } h∈[H] is modeled directly, it can be observed that log V F ( ) = O(H|O||S| 2 |A| log(1/ )). Therefore we have log N F ( ) ≤ poly(|O|, |A|, |S|, H, log(1/ )). Low-rank POMDPs. For low-rank POMDPs, when utilizing the function class introduced in Section 5, we can obtain that log V F ( ) ≤ log Y Φ ( /(3d trans )) + log Y Ψ ( /(3d trans )) + log Y O ( /3) + log Y µ ( ), which implies that log N F ( ) ≤ log Y Φ ( LR /d trans ) + log Y Ψ ( LR /d trans ) + log Y O ( LR ) + log Y µ ( LR ), where LR := O /(|O| H+2 |A| H ) . Linear POMDPs. For linear POMDPs, when utilizing the function class introduced in Section 5, it can be calculated that log V F ( ) = O(H 5 i=1 d i log( i=1 d i / )), which implies that log N F ( ) ≤ O H 2 5 i=1 d i log(|O||A|/ ) .

G.3 PROOF OF LEMMA 9

First let us prove that P π f (•) is Lipschitz continuous with respect to {M o,a,h;f , q 0;f } for any policy π, as shown in the following lemma: Lemma 11. For any f ∈ F and 0 < 1 ≤ |U A |, suppose f satisfies max o∈O,a∈A,h∈[H-1],u∈U h+1 m (o,a,u),h;f -m (o,a,u),h;f ∞ ≤ op , q 0;f -q 0;f ∞ ≤ op , where op = α 1 /(4H|U A | 2 |U||O|). Then for any policy π, we have τ H |P π f (τ H ) -P π f (τ H )| ≤ 1 . Now consider the minimum op -covering net of F, denoted by F . Then by the definition of minimum covering net, we know for any f ∈ F , there exists f ∈ F such that max o∈O,a∈A,h∈[H-1],u∈U h+1 m (o,a,u),h;f -m (o,a,u),h;f ∞ ≤ op , q 0;f -q 0;f ∞ ≤ op . Using Lemma 11, we know for any policy π and trajectory τ H , P π f (τ H ) -1 ≤ P π f (τ H ) ≤ P π f (τ H ) + 1 . Therefore, let us define g f 1 (π, •) = P π f (•) -1 and g f 2 (π, •) = P π f (•) + 1 , then the set {[g f 1 , g f 2 ] : f ∈ F } is a 2 1 (|O||A|) H -bracket of F where we use the fact that there are at most (|O||A|) H many trajectories. Let 2 1 (|O||A|) H = and then we have N F ( ) ≤ Z F (α /(8|O| H+1 |A| H H|U A | 2 |U|)).

G.4 PROOF OF LEMMA 11

We use Lemma 16 to prove this lemma via induction. First notice that we have for any o ∈ O, a ∈ A, h ∈ [H -1], u ∈ U h+1 , m (o,a,u),h;f -m (o,a,u),h;f ∞ ≤ op , q 0;f -q 0;f ∞ ≤ op . In the following discussion, we use q 0 , m (o,a,u),h , M o,a,h , b τ h to denote q 0;f , m (o,a,u),h;f , M o,a,h;f , b τ h ;f and q 0 , m (o,a,u),h , M o,a,h , b τ h to denote q 0;f , m (o,a,u),h;f , M o,a,h;f , b τ h ;f to simplify writing. Next we use induction to prove the lemma. For the base case, we have b τ0 = q 0 , b τ0 = q 0 . Therefore from (8) we have b τ0 -b τ0 1 ≤ |U| op ≤ 1 . Now suppose for any h ≤ h where h ∈ [H -2] + and policy π, we have τ h b τ h -b τ h 1 × π(τ h ) ≤ 1 . Notice that here f might not satisfy Assumption 3, but from the proof of Lemma 14 we can see that Lemma 14 still holds since f ∈ F. Therefore we have for any policy π, τ h+1 b τ h+1 -b τ h+1 1 × π(τ h+1 ) ≤ |U A | α h+1 l=1 τ l [M o l ,a l ,l -M o l ,a l ,l ]b τ l-1 1 × π(τ l ) + q 0 -q 0 1 . From ( 7), we know for any l ∈ [h + 1], τ l [M o l ,a l ,l -M o l ,a l ,l ]b τ l-1 1 × π(τ l ) ≤ op |U| τ l b τ l-1 1 × π(τ l ) = op |U||O| τ l-1 b τ l-1 1 × π(τ l-1 ) ≤ op |U||O| τ l-1 ( b τ l-1 1 × π(τ l-1 ) + b τ l-1 -b τ l-1 1 × π(τ l-1 )) ≤ op |U||O| τ l-1 (|U A | + b τ l-1 -b τ l-1 1 × π(τ l-1 )) ≤ op |U||O|( 1 + |U A |). Here the first step comes from Cauchy-Schwartz inequality and (7) . The fourth step comes from the fact that (b τ l-1 π(τ l-1 )) u = P f (u|τ l-1 )P π f (τ l-1 ) and thus τ l-1 b τ l-1 1 × π(τ l-1 ) = τ l-1 q τ l-1 ;f 1 • P π f (τ l-1 ) ≤ |U A | τ l-1 P π f (τ l-1 ) = |U A |. The last step comes from the induction hypothesis. Substituting ( 8) and ( 10) into (9), we have τ h+1 b τ h+1 -b τ h+1 1 × π(τ h+1 ) ≤ 1 . Therefore, we have for all h ∈ [H -1] and policy π, τ h b τ h -b τ h 1 × π(τ h ) ≤ 1 . Notice that from Lemma 4 and Assumption 2 (where we let m k o H ,H = m o H ,H = e o H ,H ), we have for any policy π, τ H |P π f (τ H ) -P π f (τ H )| ≤ τ H-1 b τ H-1 -b τ H-1 1 × π(τ H-1 ). Combining ( 11) and ( 12), we have for all policy π τ H |P π f (τ H ) -P π f (τ H )| ≤ 1 . This concludes our proof.

H REDUNDANCY OF m (o,a,u),h;f

In Section 4 we mention that there is redundancy in the choice of m (o,a,u),h;f given the model P π f and we can replace any m (o,a,u),h;f with its projection on the space spanned by {q τ h-1 ;f } τ h-1 . The following lemma characterizes this formally: Lemma 12. Suppose Assumption 2 holds. Given any parameter {M o,a,h;f , q 0;f }, suppose for another set of parameters {M o,a,h;f , q 0;f } we have for all o ∈ O, a ∈ A, h ∈ [H -1], u ∈ U h+1 m (o,a,u),h;f = Proj Col(K h-1;f ) (m (o,a,u),h;f ). Then for any trajectory τ H and policy π, we have P π f (τ H ) = P π f (τ H ). This means that {m (o,a,u),h;f } is also a valid set of predictive parameters for the model P f . Proof. We first show that b τ h ;f = b τ h ;f for any h ∈ [H -1] + and trajectory τ h . We prove this via induction. For the base case where h = 0, b τ h ;f = b τ h ;f = q 0;f . Next for any h ∈ [H -2] + , we suppose b τ h ;f = b τ h ;f for any h ∈ [h] + and trajectory τ h . Then for any trajectory τ h+1 , let π τ h denote the policy that always takes the action sequence in τ h . From (4) in Lemma 4, we have b τ h+1 ;f = M o h+1 ,a h+1 ,h+1;f b τ h ;f = M o h+1 ,a h+1 ,h+1;f b τ h ;f π τ h (τ h ) = M o h+1 ,a h+1 ,h+1;f q τ h ;f P πτ h (τ h ) = (m (o h+1 ,a h+1 ,u),h+1;f q τ h ;f ) u∈U h+2 P πτ h (τ h ). Similarly, since b τ h ;f = b τ h ,f , we have b τ h+1 ;f = (m (o h+1 ,a h+1 ,u),h+1;f q τ h ;f ) u∈U h+2 P πτ h (τ h ). From ( 2), we know q τ h ;f belongs to the column space of K h;f . This implies that for any u ∈ U h+2 m (o h+1 ,a h+1 ,u),h+1;f q τ h ;f = m (o h+1 ,a h+1 ,u),h+1;f q τ h ;f . Combining ( 13),( 14) and ( 15), we have b τ h+1 ;f = b τ h+1 ;f . Therefore, for any h ∈ [H -1] + and trajectory τ h , we have b τ h ;f = b τ h ;f . This suggests that for any policy π and trajctory τ H-1 , we have (P f (u|τ H-1 )P π f (τ H-1 )) u∈U H = (P f (u|τ H-1 )P π f (τ H-1 )) u∈U H . Therefore with Assumption 2 we have for any policy π and trajectory τ H , we have P π f (τ H ) = P π f (τ H ).

I PROOF OF THEOREM 1

In this section we present a proof sketch for Theorem 1. Note that to prove Theorem 1, we only need to show that CRANE can achieve sublinear total suboptimality, which is stated in the following theorem: Theorem 2. Under Assumption 1,2,3, there exists an absolute constant c such that for any δ ∈ (0, 1], T ∈ N, if we choose β = c log(N F ( b )T H|U A |/δ) in CRANE where b = 1/(T H|U A |), then with probability at least 1 -δ, we have: T k=1 (V * -V π k ) ≤ O(d 2 PSR H 7 2 |U A | 4 |A| 2 T 1 2 α -3 • log(T HN F ( b )|O||A|/δ)). The √ T bound on the regret in Theorem 2 suggests that the uniform mixture of the output policies π = Unif({π k } T k=1 ) is an -optimal policy when T = O(1/ 2 ), leading to Theorem 1 directly. Therefore we only need to prove Theorem 2 now. Note that we can decompose the total suboptimality into the following terms: Regret(T ) = T k=1 V * -V π k = T t=1 V * -V π k f k + T t=1 V π k f k -V π k . Our proof bounds these two terms separately and mainly consists of four steps: 1. Prove V π k f k is an optimistic estimation of V * for all k ∈ [T ] , which implies that term (1) ≤ 0. 2. Decompose term (2) into the estimation error of the parameter M o,a,h via the system dynamics (3). 3. Bound the cumulative estimation error using the property of MLE. Proof. See Appendix J.1. Then since V π k f k = max f ∈B k ,π V π f , we know for all k ∈ [T ], V π k f k ≥ V π * f * = V * . Thus, Lemma 13 implies that V π k f k is an optimistic estimation of V * for all k, and therefore term (1) in ( 16) is non-positive.

I.2 STEP 2: DECOMPOSE THE PERFORMANCE DIFFERENCE

Next we aim to handle term (2) in ( 16) and show the estimation error T t=1 V π k f k -V π k is small. First we need to decompose the performance difference V π k f k -V π k into the estimation error of the parameters M o,a,h in order to apply the property of MLE later. Notice that we have, V π k f k -V π k ≤ H τ H |P π k f k (τ H ) -P π k f * (τ H )| = H τ H (m k o H ,H ) • H-1 h=1 M k o h ,a h ,h • q k 0 -m o H ,H • H-1 h=1 M o h ,a h ,h • q 0 × π k (τ H ) = H τ H-1 H-1 h=1 M k o h ,a h ,h • q k 0 - H-1 h=1 M o h ,a h ,h • q 0 1 × π k (τ H-1 ), where we use  m k o H ,H , M k o h ,a h ,h , q k 0 to denote m o H ,H;f k , M o h ,a h ,h;f k , q 0;f k and m o H ,H , M o h ,a h ,h , q 0 to denote m o H ,H;f * , M o h ,a h ,h;f * , q 0;f * . τ h h l=1 M k o l ,a l ,l • q k 0 - h l=1 M o l ,a l ,l • q 0 1 × π(τ h ) ≤ |U A | α h l=1 τ l [M k o l ,a l ,l -M o l ,a l ,l ]b τ l-1 1 × π(τ l ) + q k 0 -q 0 1 , where b τ l = l j=1 M oj ,aj ,j q 0 . Remark 3. From the proof of Lemma 14, we can see that Lemma 14 only utilizes the properties of {M k o l ,a l ,l , q k 0 } l∈ [h] . Therefore Lemma 14 still holds even if the system dynamics induced by {M o l ,a l ,l , q 0 } l∈[h] is invalid. We will use this fact in the analysis about the -bracket number. Therefore, substituting Lemma 14 into (17) , we can bound the performance difference by the cumulative estimation error: T k=1 V π k f k -V π k ≤ |U A |H α T k=1 H-1 h=1 τ h [M k o h ,a h ,h -M o h ,a h ,h ]b τ h-1 1 × π k (τ h ) + q k 0 -q 0 1 . I.3 STEP 3: BOUND THE ESTIMATION ERROR Now we need to bound the estimation error in (19) . First we introduce the following guarantee of MLE from the literature, which connects the log-likelihood ratio log(P π f * (τ H )/P π f (τ H )) and the total variation τ H |P π f (τ H ) -P π f * (τ H )|: Lemma 15 ([37, Proposition 14] ). There exists a universal constant c 1 such that for any δ ∈ (0, 1], for all k ∈ [T ] and f ∈ F, we have with probability at least 1 -δ/2 that k i=1 h∈[H-1] + ,u a,h+1 ∈U A,h+1 τ H |P π i,u a,h+1 ,h f (τ H ) -P π i,u a,h+1 ,h f * (τ H )| 2 ≤ c 0 k i=1 h∈[H-1] + ,u a,h+1 ∈U A,h+1 log P π i,u a,h+1 ,h f * (τ i,u a,h+1 ,h H ) P π i,u a,h+1 ,h f (τ i,u a,h+1 ,h H ) + log(N F ( b )T H|U A |/δ) . Combining Lemma 15 and the fact that both f * and f k belongs to B k , we have with probability at least 1 -δ that for all k ∈ [T ], k-1 i=1 h∈[H-1] + ,u a,h+1 ∈U A,h+1 τ H |P π i,u a,h+1 ,h f k (τ H ) -P π i,u a,h+1 ,h f * (τ H )| 2 ≤ O(β). The following discussion is conditioned on the event in (20) being true. Then by Cauchy-Schwarz inequality we have for all k ∈ [T ], k-1 i=1 h∈[H-1] + ,u a,h+1 ∈U A,h+1 τ H |P π i,u a,h+1 ,h f k (τ H ) -P π i,u a,h+1 ,h f * (τ H )| ≤ O( kH|U A |β). ( ) Suppose the length of the longest action sequence in U A,h is l a . Then since the environment will only generate dummy observations o dummy after a H , we have for any policy π and f ∈ F, τ H+la +1 |P π f (τ H+la+1 ) -P π f * (τ H+la+1 )| = τ H+la +1 |P π f (τ H )1(o H+1:H+la+1 = o dummy )π(a H+1:H+la+1 |τ H ) -P π f * (τ H )1(o H+1:H+la+1 = o dummy )π(a H+1:H+la+1 |τ H )| = τ H |P π f (τ H ) -P π f * (τ H )|. Therefore, we can marginalize the distribution P π f (τ H ) and P π f * (τ H ) in (21) and have for all k ∈ [T ], i ∈ [k -1], h ∈ [H -1] + , u a,h+1 ∈U A,h+1 τ H |P π i,u a,h+1 ,h f k (τ H ) -P π i,u a,h+1 ,h f * (τ H )| ≥ u a,h+1 ∈U A,h+1 τ h ,o∈O(u a,h+1 ) |P π i,u a,h+1 ,h f k (τ h , u a,h+1 , o) -P π i,u a,h+1 ,h f * (τ h , u a,h+1 , o)| = u a,h+1 ∈U A,h+1 τ h ,o∈O(u a,h+1 ) |P π i,h f k (τ h )P f k (o|τ h ; do(u a,h+1 )) -P π i,h f * (τ h )P f * (o|τ h ; do(u a,h+1 ))| × π i,u a,h+1 ,h (u a,h+1 |τ h ) = τ h u a,h+1 ∈U A,h+1 ,o∈O(u a,h+1 ) |P π i,h f k (τ h )P f k (o|τ h ; do(u a,h+1 )) -P π i,h f * (τ h )P f * (o|τ h ; do(u a,h+1 ))| = τ h u∈U h+1 |P π i,h f k (τ h )P f k (u|τ h ) -P π i,h f * (τ h )P f * (u|τ h )|. Here in the first step O(u a,h+1 ) denote the set of observation sequences that occur together with u a,h+1 in U h+1 and P π f (τ h , u a,h+1 , o) denotes the joint probability of observing the trajectory (τ h , o, u a,h+1 ). In the third and fourth step we utilize the fact that π k,u a,h+1 ,h = π k 1:h-1 • Unif(A) • u a,h+1 and we define π i,h := π i 1:h-1 • Unif(A). Then based on Eq. ( 4) and Eq. ( 21), we have for all k ∈ [T ], h ∈ [H -1] + , k-1 i=1 τ h π i,h (τ h ) • b k τ h -b τ h 1 ≤ O( kH|U A |β). Thus via importance weighting, we have for all k ∈ [T ], h ∈ [H -1] + , k-1 i=1 τ h π i (τ h ) • b k τ h -b τ h 1 ≤ O(|A| kH|U A |β), -1 i=1 τ h π i (τ h-1 ) • b k τ h -b τ h 1 ≤ O(|A| kH|U A |β). In particular, when h = 0 we have q k 0 -q 0 1 ≤ O( H|U A |β/k) Now for all k ∈ [T ], h ∈ [H -1], we can bound the estimation error as follows: k-1 i=1 τ h [M k o h ,a h ,h -M o h ,a h ,h ]b τ h-1 1 × π i (τ h-1 ) ≤ k-1 i=1 τ h [M k o h ,a h ,h b k τ h-1 -M o h ,a h ,h b τ h-1 ] 1 × π i (τ h-1 ) + k-1 i=1 τ h M k o h ,a h ,h [b k τ h-1 -b τ h-1 ] 1 × π i (τ h-1 ). For the first term in (25), we have k-1 i=1 τ h [M k o h ,a h ,h b k τ h-1 -M o h ,a h ,h b τ h-1 ] 1 × π i (τ h-1 ) = k-1 i=1 τ h π i (τ h-1 ) • b k τ h -b τ h 1 ≤ O(|A| kH|U A |β), where the second step is due to (23) . To bound the second term, we need to bound M k o,a,h 1,1 first, which is given in the following lemma: Lemma 16. For any 1 ≤ j 1 ≤ j 2 ≤ H -1, trajectory τ j1-1 , policy π, f ∈ F and x ∈ R |Uj 1 | , we have τj 1 :j 2 j2 j=j1 M oj ,aj ,j;f x 1 × π(τ j1:j2 |τ j1-1 ) ≤ |U A | α x 1 . The proof of the above lemma uses the regularity condition in Assumption 3 and Lemma 12. Naively, the product j2 j=j1 M oj ,aj ,j;f may indicate that the norm may grow exponentially. However, the condition that M o,a,h;f 's row span belongs to the column span of K h-1;f (which is dereived from Lemma 12) and the fact that K † h-1;f exists, we have: j2 j=j1 M oj ,aj ,j;f x = j2 j=j1 M oj ,aj ,j;f K j1-1;f K † j1-1;f x Thus, we can bound j2 j=j1 M oj ,aj ,j;f K j1-1;f e l 1 by using the fact that K j1-1;f e l is a predictive state q τ l j 1 -1;f ;f corresponding to one of the minimum core histories τ l j1-1;f , and j2 j=j1 M oj ,aj ,j;f • q τ l j 1 -1;f ;f ×π(τ j1:j2 |τ j1-1 ) = [P(u|τ l j1-1;f , τ j1:j2 )P πτ j 1 -1 (τ j1:j2 |τ l j1-1;f )] u∈Uj 2 +1 where π τj 1 -1 denote the policy π(•|τ j1-1 ). Note that the proof of the above lemma differs from the one in POMDPs since here we leverage the concept of minimum core histories and the core matrix which are unique to PSRs. The details are deferred to Appendix J.3. Therefore, using Lemma 16 with π = π i 1:h-1 • Unif(A), we have k-1 i=1 τ h M k o h ,a h ,h [b k τ h-1 -b τ h-1 ] 1 × π i (τ h-1 ) ≤ |A||U A | α k-1 i=1 τ h-1 π i (τ h-1 ) • b k τ h-1 -b τ h-1 1 ≤ O(|A| 2 kH|U A | 3 β/α), where the last step comes from (22) . Combining (26) and ( 27), we have for all k ∈ [T ], h ∈ [H -1], k-1 i=1 τ h [M k o h ,a h ,h -M o h ,a h ,h ]b τ h-1 1 × π i (τ h-1 ) ≤ O(|A| 2 kH|U A | 3 β/α). I.4 STEP 4: CONNECT STEP 2 AND STEP 3 Recall that in Step 2 we want to bound T k=1 H-1 h=1 τ h [M k o h ,a h ,h -M o h ,a h ,h ]b τ h-1 1 × π k (τ h ) + q k 0 -q 0 1 . First, for the second term, we can bound via (24): T k=1 q k 0 -q 0 1 ≤ O( HT |U A |β). Now we only need to bound the first term. Notice that in (28) we have bounded this cumulative estimation error weighted by π i (τ h-1 ) rather than π k (τ h-1 ). Here we introduce the following lemma from [37] to bridge these two summations with different weights: Lemma 17 ([37, Proposition 22]). Suppose {x k,i } (k,i)∈[T ]×[n1] , {w k,j } (k,j)∈[T ]×[n2] ∈ R d satisfy for all k ∈ [T ] • k-1 t=1 n1 i=1 n2 j=1 |w k,j x t,i | ≤ γ k , • n1 i=1 x k,i 2 ≤ R x , • n2 i=1 w k,i 2 ≤ R w . Then we have for all k ∈ [T ]: k t=1 n1 i=1 n2 j=1 |w t,j x t,i | = O d R w R x + max t≤k γ t log 2 (T n 1 ) . To apply Lemma 17, for any fixed h ∈ [H -1], we rewrite (28) in the following way: k-1 t=1 |U h+1 | u=1 o,a τ h-1 M k o,a,h -M o,a,h K h-1 u K † h-1 b τ h-1 × π t (τ h-1 ) ≤ O(|A| 2 kH|U A | 3 β/α), where X u is the u-th row of the matrix X. Here we utilize the fact that b τ h-1 × π t (τ h-1 ) = (P[u|τ h-1 ]P π t [τ h-1 ] ) u∈U h belongs to the column space of K h-1 due to the definition of core history. Then for any t ∈ [T ], u ∈ U h+1 , o ∈ O, a ∈ A, we let w t,u,o,a denote M t o,a,h -M o,a,h K h-1 u and x t,τ h-1 denote K † h-1 b τ h-1 × π t (τ h-1 ), then (30) can be written as for any k ∈ [T ] k-1 t=1 u∈U h+1 ,o∈O,a∈A τ h-1 |w k,u,o,a x t,τ h-1 | ≤ O(|A| 2 kH|U A | 3 β/α). Now we only need to bound τ h-1 x k,τ h-1 2 and u∈U h+1 ,o∈O,a∈A w k,u,o,a 2 . For τ h-1 x k,τ h-1 2 , we have τ h-1 x k,τ h-1 2 = τ h-1 K † h-1 b τ h-1 × π k (τ h-1 ) 2 = τ h-1 K † h-1 [P(u|τ h-1 )P π k (τ h-1 )] u∈U h 2 = τ h-1 v τ h-1 2 P π k (τ h-1 ) ≤ max τ h-1 v τ h-1 2 , where the third step comes from the definition of core matrix (2) .

Notice that we have

K h-1 v τ h-1 = [P(u|τ h-1 )] u∈U h for any τ h-1 and K † h-1 1 →1 ≤ 1/α, which implies v τ h-1 2 ≤ v τ h-1 1 ≤ K † h-1 [P(u|τ h-1 )] u∈U h 1 ≤ 1 α [P(u|τ h-1 )] u∈U h 1 ≤ |U A | α . Therefore we have for all k ∈ [T ], τ h-1 x k,τ h-1 2 ≤ |U A | α . For u∈U h+1 ,o∈O,a∈A w k,u,o,a 2 , we have u∈U h+1 ,o∈O,a∈A w k,u,o,a 2 ≤ u∈U h+1 ,o∈O,a∈A w k,u,o,a 1 = o∈O,a∈A d PSR,h-1 l=1 ((M k o,a,h -M o,a,h )K h-1 e l 1 ≤ 2|A||U A | α d PSR,h-1 l=1 K h-1 e l 1 ≤ 2|A||U A | 2 d PSR α , where the third step utilizes Lemma 16 with uniform policy and the last step utilizes the fact d PSR,h-1 ≤ d PSR and K h-1 e l 1 = q τ l h-1 1 ≤ |U A |. Invoking Lemma 17 with (32),( 33), (31) , we can obtain for all k ∈ [T ], h ∈ [H -1], k i=1 τ h [M k o h ,a h ,h -M o h ,a h ,h ]b τ h-1 1 × π k (τ h-1 ) ≤ O(d PSR,h-1 |U A | 3 |A| 2 d PSR H 3 2 k 1 2 /α 2 • log(T HN F ( b )|O||A|/δ)). Substituing ( 29),( 34) into ( 19), we have T k=1 V π k f k -V π k ≤ O(d 2 PSR H 7 2 |U A | 4 |A| 2 T 1 2 α -3 • log(T HN F ( b )|O||A|/δ)). Combining the above result with Step 1, we have T k=1 V * -V π k ≤ O(d 2 PSR H 7 2 |U A | 4 |A| 2 T 1 2 α -3 • log(T HN F ( b )|O||A|/δ)). This concludes our proof.

J PROOFS OF LEMMAS IN APPENDIX I J.1 PROOF OF LEMMA 13

To prove f * ∈ B k , we need to show that (π,τ H )∈D log P π f * (τ H ) is large. To simplify writing, we denote the (π, τ H ) pairs in D at the end of T -th iteration by {(π i , τ i H )} n T i=1 , which are indexed by their collection order. Notice that n T ≤ T H|U A |. To deal with potentially infinite function clas F, we first consider its minimum b -bracket net G where b = 1/(T H|U A |) and the set of all upper bound functions in G, i.e., G u := {f : ∃f, such that [f, f ] ∈ G}. Then we are able to bound the difference bewteen (π,τ H )∈D log P π f * (τ H ) and (π,τ H )∈D log P π f (τ H ) for any f ∈ F via Cramér-Chernoff's method as in [37] . Fix any f ∈ G u , t ∈ [n T ] and let F t denote the filtration induced by {(π i , τ i )} t-1 i=1 ∪ {π t }. We have: E exp t i=1 log P π i f (τ i H ) P π i f * (τ i H ) = E exp t-1 i=1 log P π i f (τ i H ) P π i f * (τ i H ) • E exp log P π t f (τ t H ) P π t f * (τ t H ) F t = E exp t-1 i=1 log P π i f (τ i H ) P π i f * (τ i H ) • E P π t f (τ t H ) P π t f * (τ t H ) F t = E exp t-1 i=1 log P π i f (τ i H ) P π i f * (τ i H ) • τ H P π t f (τ H ) ≤ E exp t-1 i=1 log P π i f (τ i H ) P π i f * (τ i H ) • 1 + 1 T H|U A | , where the last step is due to the fact that G is the minimum b -bracket net, which implies that there exists f ∈ F such that P π f (•) -P π f (•) 1 ≤ b for any policy π and thus P π f (•) 1 ≤ 1 + b . Repeat the above arguments and we have E exp t i=1 log P π i f (τ i H ) P π i f * (τ i H ) ≤ e. Then by Markov's inequality we have for any δ ∈ (0, 1], P t i=1 log P π i f (τ i H ) P π i f * (τ i H ) > log(1/δ) ≤ E exp t i=1 log P π i f (τ i H ) P π i f * (τ i H ) • exp[-log(1/δ)] ≤ eδ. Therefore by union bound, we have for all f ∈ G u , t ∈ [n T ], P t i=1 log P π i f (τ i H ) P π i f * (τ i H ) > c log(N F ( b )T H|U A |/δ) ≤ δ/2, where c is a universal constant. Finally, due to the definition of -bracket net, we know for all f ∈ F, there exists f ∈ G u such that P π f (τ H ) ≤ P π f (τ H ) for any trajectory τ H and policy π. Therefore we have for all f ∈ F, t ∈ [n T ], P t i=1 log P π i f (τ i H ) P π i f * (τ i H ) > c log(N F ( b )T H|U A |/δ) ≤ δ/2, which implies that f * ∈ B k for all k ∈ [T ] with probability at least 1 -δ/2. This concludes our proof. J.2 PROOF OF LEMMA 14 First, notice that we can decompose the left hand side of (18) into the following sequence of terms via triangle inequality: τ h h l=1 M k o l ,a l ,l • q k 0 - h l=1 M o l ,a l ,l • q 0 1 × π(τ h ) ≤ h j=1 τ h h l=j+1 M k o l ,a l ,l M k oj ,aj ,j -M oj ,aj ,j • b τj-1 1 × π(τ h ) + τ h h l=1 M k o l ,a l ,l q k 0 -q 0 1 × π(τ h ). Then fix j ∈ [h] and consider the term τ h h l=j+1 M k o l ,a l ,l M k oj ,aj ,j -M oj ,aj ,j •b τj-1 1 ×π(τ h ) in (35) . We have • π(τ j ), (36) where the last step comes from Lemma 16. Similarly, apply Lemma 16 to the second part of (35) and we have τ h h l=1 M k o l ,a l ,l q k 0 -q 0 1 × π(τ h ) ≤ |U A | α q k 0 -q 0 1 . Substituting ( 36) and ( 37) into (35) , we can obtain τ h h l=1 M k o l ,a l ,l • q k 0 - h l=1 M o l ,a l ,l • q 0 1 × π(τ h ) ≤ |U A | α h l=1 τ l [M k o l ,a l ,l -M o l ,a l ,l ]b τ l-1 1 × π(τ l ) + q k 0 -q 0 1 . This concludes our proof.

J.3 PROOF OF LEMMA 16

First, based on Lemma 12, we have chosen m (o,a,u),j1;f to belong to the column space of K j1-1;f , which implies that Note that since the l-th column of K j1-1;f is q τ l j 1 -1;f ;f , the l-th core history at step j 1 -1 under the model induced by f , we have for any l ∈ [d PSR,j1-1;f ], Here in the second step π((o j1:j2 , a j1:j2 )|τ j1-1 ) denotes j2 j=j1 π(a j |τ j1-1 , o j1:j , a j1:j-1 ) and the third step comes from Lemma 4. 

Therefore we have

≤ |U A | K † j1-1;f x 1 ≤ |U A | α x 1 , where the third step comes from Assumption 3. This concludes our proof. K NECESSITY OF 1/α IN THEOREM 1 In this section we show that the polynomial dependence on 1/α in Theorem 1 is inevitable in general. More specifically, we have the following theorem: Theorem 3. For any 0 < α < 1 2 √ 2 and H, |A| ∈ N + , there exists a PSR with core test set U h = O for h ∈ [H] and |S| = |O| = O(1) which satisfies Assumption 1 so that any algorithm requires at least Ω(min{ 1 αH , |A| H-1 }) samples to learn a (1/2)-optimal policy with probability 1/6 or higher. Theorem 3 indicates that scaling with 1/α is unavoidable or else the algorithm will require an exponential number of samples to learn a near optimal policy. The proof is deferred to Appendix K.1.

K.1 PROOF OF THEOREM 3

We leverage the hard instance constructed in [37] to prove the lower bound, which is based on combinatorial lock. More specifically, we define a POMDP as follows: • State space: There are two states, S = {s g , s b }. • Observation space and emission matrices: There are three observations, O = {o g , o b , o dummy }. For h ∈ [H -1], we define the emission matrix as follows: O h =   √ 2α 0 0 √ 2α 1 - √ 2α 1 - √ 2α   . For h = H, we have O H = 1 0 0 1 0 0 . This means that with probability α we can observe the current state and with probability 1 -α we only receive a dummy observation at step h ∈ [H -1]. At step H, though, we are able to observe the current state. • Action space and transition kernels: There are |A| actions and the initial state is fixed as s g . For each step h ∈ [H -1], there exists a good action a g,h ∈ A which is chosen uniformly at random from A such that if the agent is currently in s h = s g and takes a g,h , it will stay in s g , i.e., s h+1 = s g . Otherwise, the agent will always go to s h+1 = s b . • Reward: We define r h (o) = 0 for all h ∈ [H -1] and o ∈ O. At step H, r H (o g ) = 1 while r H (o b ) = 0. This indicates that the agent will receive reward 1 iff the agent takes a g,h along its way. Since this POMDP satisifes weakly-revealing condition, we know O is its core test set. Next we show that this POMDP satisfies Assumption 1. First it is can be observed that K 0 = q 0 = ( √ 2α, 0, 1 -√ 2α) and we can verify that K † 0 1 →1 ≤ 1/α. Then for any h ∈ [H -1] and reachable history τ h , if a 1:h = a g,1:h , we have  K † h 1 →1 ≤ √ 2 K † h 2 →2 ≤ √ 2/( √ 2α) = 1/α. This shows that the constructed POMDP satisfies Assumption 1. Now we only need to show that the constructed POMDP attains the lower bound in Theorem 3. This has been proved in [37] and we include the proof here for completeness. Suppose we can only interact with the POMDP for T ≤ 1 2 √ 2αH episodes. Then we know the probability that both s g and s b only emit o dummy in the first H -1 steps for all T episodes is lower bounded by (1 - √ 2α) 1/( √ 2α) since 2 • 1 2 √ 2αH • (H -1) ≤ 1/( √ 2α). Now conditioned on the event that both s g and s b only emit o dummy in the first H -1 steps for all T episodes, we can only random guess the optimal action sequence a g,1:H-1 . Then if T ≤ |A| H-1 /10, the probability that we fail to guess the optimal action sequence is |A| H-1 -1 T |A| H-1 T ≥ 0.9, Therefore, with probability 0.9 × (1 -√ 2α) 1/( √ 2α) ≥ 1/6, the agent can only learn that the action sequences it chooses in these T episodes is incorrect, which implies that the agent can only random guess from the remained action sequences. Therefore, if T ≤ |A| H-1 /10, the policy that the agent outputs will be worse than 1/2-optimal, which concludes our proof.



Corollary 2 only considers finite function class F. With the above discussion, we can extend it to infinite function classes as follows: Corollary 4 (Sample complexity for m-step weakly-revealing low-rank POMDPs). Suppose low-rank POMDPs are m-step weakly-revealing, and we execute CRANE with β = c log(N F ( b )T H|U A |/δ) up to the step H -m where U h and N F ( b ) are specified in Table

a), linear POMDPs are also low-rank POMDPs with dimension min{d 1 , d 2 } and thus for linear POMDPs we have d PSR ≤ d lin .

also discusses linear POMDPs and achieves a similar polynomial sample complexity. However, they only consider undercomplete setting (i.e., |O| ≥ |S|) and assume {(ψ h (s)) i } s∈S is a distribution on S for any i ∈ [d 2 ]. In addition, they not only assume the transition and emission are linear, but also impose a linear structure on the state distribution conditioned on future observations such as Cai et al. [7, Assumption 2.2]. Therefore our model is more general and requires fewer assumptions. A.5 m-STEP DECODABLE TABULAR/LOW-RANK/LINEAR POMDPS Next, we supplement the discussion about m-step decodable POMDPs in Section 5.

Bound term (2) by connecting the results in the second and third step. I.1 STEP 1: PROVE OPTIMISM First we can show that the constructed set B k contains the true model parameter f * with high probability: Lemma 13. With probability at least 1 -δ/2, we have for all k ∈ [T ], f * ∈ B k .

l ,a l ,l M k oj ,aj ,j -M oj ,aj ,j• b τj-1 π(τ h ) = τj π(τ j ) τ j+1:h h l=j+1 M k o l ,a l ,l M k oj ,aj ,j -M oj ,aj ,j • b τj-1 π(τ j+1:h |τ j ) ≤ |U A | α τj M k oj ,aj ,j -M oj ,aj ,j • b τj-11

oj ,aj ,j;f K j1-1;f K † j1-1;f x π(τ j1:j2 |τ j1-1 ).

oj ,aj ,j;f K j1-1;f e l π(τ j1:j2 |τ j1-1 ) =oj 1 :j 2 aj 1 :j 2 j2 j=j1 M oj ,aj ,j;f K j1-1;f e l 1 π((o j1:j2 , a j1:j2 )|τ j1-1 ) = oj 1 :j 2 aj 1 :j 2 u∈Uj 2 +1 P f (u|(τ l j1-1 , o j1:j2 , a j1:j2 ))P f (o j1:j2 |τ l j1-1;f ; do(a j1:j2-1 ))π((o j1:j2 , a j1:j2 )|τ j1-1 ) = f (u|(τ l j1-1 , o j1:j2 , a j1:j2 ))    P f (o j1:j2 |τ l j1-1;f ; do(a j1:j2-1 ))π((o j1:j2 , a j1:j2 )|τ j1-1 ) ≤ |U A |oj 1 :j 2 aj 1 :j 2 P f (o j1:j2 |τ l j1-1;f ; do(a j1:j2-1 ))π((o j1:j2 , a j1:j2 )|τ j1-1 ) ≤ |U A |.

oj ,aj ,j;f K j1-1;f K † j1-1;f x π(τ j1:j2 |τ j1-1 )

(s h+1 = s g |τ h ) = 1, P(s h+1 = s b |τ h ) = 0, which implies thatP(o h+1 = o g |τ h ) = √ 2α, P(o h+1 = o b |τ h ) = 0, P(o h+1 = o dummy |τ h ) = 1 -√ 2α.Otherwise, if there exists h ∈ [h] such that a h = a g,h , then we haveP(s h+1 = s b |τ h ) = 1, P(s h+1 = s g |τ h ) = 0,which impliesP(o h+1 = o b |τ h ) = √ 2α, P(o h+1 = o g |τ h ) = 0, P(o h+1 = o dummy |τ h ) = 1 -√ 2α.This suggests thatK h = O h+1 for h ∈ [H -1]. On the otherhand, since σ min (K h ) = σ min (O h+1 ) ≥ √ 2α, we have for h ∈ [H -1],

|A|, |S|, H, log(1/ )) low-rank POMDPs ≤ d trans Refer to Appendix A.2 linear POMDPs ≤ d lin poly(d lin , H, log(|O||A|/ )) Core test sets, minimum core test size and bracket number for POMDP models. Here all the POMDPs we consider are m-step weakly-revealing or m-step decodable. The exact function classes F we use are elaborated in the following discussion.

where ψ h : S → R dtrans and φ h : S × A → R dtrans are unknown feature vectors. Then, we call these POMDPs as low-rank POMDPs.The low-rank structure leads to a smaller minimum core test set size than general POMDPs since we can show d PSR ≤ d trans as in Appendix E. Weakly-revealing low-rank POMDPs are defined

EXAMPLES: MORE DETAILS AND MODELSIn this section, we supplement the details in Section 5 and illustrate the sample complexity of CRANE to learn tabular PSRs and several other POMDPs in comparison with existing algorithms. We consider two types of POMDPs: m-step weakly-revealing POMDPs and m-step decodable POMDPs. Assumptions like weakly-revealing property and decodability allow us to identify core test sets.Core test sets and function classes. In this case we can choose U h to be the set of all m-step futures (O × A) m-1 × O. For the function class F, we first let it model the parameters {T h;f , O h;f , µ 1;f } h∈[H] directly, lift weakly-revealing POMDPs to PSR formulation, and then pre-process it to satisfy Assumption 3. The corresponding -bracket number N F ( ) is shown in Table2. Besides, since now the core test set is (O × A) m-1 × O, we let m

Notations of PSRs. We also refer readers to Figure1for an illustration of the notations such as U h , D h , D h , and K h .

Condition 1]). For any h ∈ [H -l + 1], s h ∈ S and any )), where P m denotes the probability under MDP m. Let L s h

The second step is due to Lemma 4 and the last step is because based on Assumption 2 we have set m k o H ,H = m o H ,H = e o H ,H . The following lemma bridges the term in (17) and the estimation error of M o,a,h , whose proof is deferred to Appendix J.2: Lemma 14. For any k ∈ [T ], h ∈ [H] and policy π, we have

G -BRACKET NUMBER OF F

In this section we introduce some basic properties of the -bracket number N F ( ). We first consider PSRs and then take POMDPs as special examples.

G.1 PSRS

For PSRs, let us define the covering number for the parameters {M o,a,h;f , q 0;f } as follows: Definition 7 ( -covering number). The -covering number of F, denoted by Z F ( ), is the minimum integer n such that there exists a function class F with |F | = n and for any f ∈ F there exists f ∈ F such that max o∈O,a∈A,h∈[H-1],u∈U h+1 m (o,a,u),h;f -m (o,a,u),h;f ∞ ≤ and q 0;fq 0;f ∞ ≤ .Here F does not need to be valid PSR model classes and m (o,a,u),h;f does not need to belong to the column space of K h;f . That said, we still use, although this might no longer be a valid distribution. Then the following lemma shows that the bracket number can be upper bounded by the covering number, whose proof is deferred to Appendix G.3. Lemma 9. Given F and any > 0, suppose Assumption 1,2 and 3 hold, then we haveSince the log covering number log Z F ( ) typically scales with log(1/ ), Lemma 9 shows that log N F ( ) also scales with polynomial H in general.Tabular PSRs. Let us consider the tabular cases for example where we directly use {M o,a,h , q 0 } o∈O,a∈A,h∈[H-1] as the parameters of F and assume for all f ∈ F we have Then we have the following lemma: Lemma 10. For any f ∈ F and 0). Then for any policy π, we haveThe proof is omitted here since it follows similar arguments in the proof of Lemma 11. Therefore, following the arguments in the proof of Lemma 9, we know N F ( ) ≤ V F ( /(28|O| H+2 |A| H )).

