PARTIALLY OBSERVABLE RL WITH B-STABILITY: UNIFIED STRUCTURAL CONDITION AND SHARP SAMPLE-EFFICIENT ALGORITHMS

Abstract

Partial Observability-where agents can only observe partial information about the true underlying state of the system-is ubiquitous in real-world applications of Reinforcement Learning (RL). Theoretically, learning a near-optimal policy under partial observability is known to be hard in the worst case due to an exponential sample complexity lower bound. Recent work has identified several tractable subclasses that are learnable with polynomial samples, such as Partially Observable Markov Decision Processes (POMDPs) with certain revealing or decodability conditions. However, this line of research is still in its infancy, where (1) unified structural conditions enabling sample-efficient learning are lacking; (2) existing sample complexities for known tractable subclasses are far from sharp; and (3) fewer sample-efficient algorithms are available than in fully observable RL. This paper advances all three aspects above for Partially Observable RL in the general setting of Predictive State Representations (PSRs). First, we propose a natural and unified structural condition for PSRs called B-stability. B-stable PSRs encompasses the vast majority of known tractable subclasses such as weakly revealing POMDPs, low-rank future-sufficient POMDPs, decodable POMDPs, and regular PSRs. Next, we show that any B-stable PSR can be learned with polynomial samples in relevant problem parameters. When instantiated in the aforementioned subclasses, our sample complexities improve substantially over the current best ones. Finally, our results are achieved by three algorithms simultaneously: Optimistic Maximum Likelihood Estimation, Estimation-to-Decisions, and Model-Based Optimistic Posterior Sampling. The latter two algorithms are new for sample-efficient learning of POMDPs/PSRs. We additionally design a variant of the Estimation-to-Decisions algorithm to perform sample-efficient all-policy model estimation for B-stable PSRs, which also yields guarantees for reward-free learning as an implication.

1. INTRODUCTION

Partially Observable Reinforcement Learning (RL)-where agents can only observe partial information about the true underlying state of the system-is ubiquitous in real-world applications of RL such as robotics (Akkaya et al., 2019) , strategic games (Brown & Sandholm, 2018; Vinyals et al., 2019; Berner et al., 2019) , economic simulation (Zheng et al., 2020) , and so on. Partially observable RL defies standard efficient approaches for learning and planning in the fully observable case (e.g. those based on dynamical programming) due to the non-Markovian nature of the observations (Jaakkola et al., 1994) , and has been a hard challenge for RL research. Theoretically, it is well-established that learning in partial observable RL is statistically hard in the worst case-In the standard setting of Partially Observable Markov Decision Processes (POMDPs), learning a near-optimal policy has an exponential sample complexity lower bound in the horizon length (Mossel & Roch, 2005; Krishnamurthy et al., 2016) , which in stark contrast to fully observable MDPs where polynomial sample complexity is possible (Kearns & Singh, 2002; Jaksch et al., Table 1 : Comparisons of sample complexities for learning an ε near-optimal policy in POMDPs and PSRs. Definitions of the problem parameters can be found in Section 3.2. The last three rows refer to the m-step versions of the problem classes (e.g. the third row considers m-step αrev-revealing POMDPs). The current best results within the last four rows are due to Zhan et al. ( 2022 2022) respectivelyfoot_0 . All results are scaled to the setting with total reward in r0, 1s.

Problem Class

Current Best Ours ΛB-stable PSR - r O `dPSRAUAH 2 log NΘ ¨Λ2 B {ε 2 αpsr-regular PSR r O `d4 PSR A 4 U 9 A H 6 logpNΘOq{pα 6 psr ε 2 q ˘r O `dPSRAU 2 A H 2 log NΘ{pα 2 psr ε 2 q αrev-revealing tabular POMDP r O `S4 A 6m´4 H 6 log NΘ{pα 4 rev ε 2 q ˘r O `S2 A m H 2 log NΘ{pα 2 rev ε 2 q ν-future-suff. rank-dtrans POMDP r O `d4 trans A 5m`3l`1 H 2 plog NΘq 2 ¨ν4 γ 2 {ε 2 ˘r O `dtransA 2m´1 H 2 log NΘ ¨ν2 {ε 2 decodable rank-dtrans POMDP r O `dtransA m H 2 log NG{ε 2 ˘r O `dtransA m H 2 log NΘ{ε 2 2010; Azar et al., 2017). A later line of work identifies various additional structural conditions or alternative learning goals that enable sample-efficient learning, such as reactiveness (Jiang et al., 2017) , revealing conditions (Jin et al., 2020a; Liu et al., 2022c; Cai et al., 2022; Wang et al., 2022) , decodability (Du et al., 2019; Efroni et al., 2022) , and learning memoryless or short-memory policies (Azizzadenesheli et al., 2018; Uehara et al., 2022b) . Despite these progresses, research on sample-efficient partially observable RL is still at an early stage, with several important questions remaining open. First, to a large extent, existing tractable structural conditions are mostly identified and analyzed in a case-by-case manner and lack a more unified understanding. This question has just started to be tackled in the very recent work of Zhan et al. ( 2022), who show that sample-efficient learning is possible in the more general setting of Predictive State Representations (PSRs) (Littman & Sutton, 2001)-which include POMDPs as a special case-with a certain regularity condition. However, their regularity condition is defined in terms of additional quantities (such as "core matrices") not directly encoded in the definition of PSRs, which makes it unnatural in many known examples and unable to subsume important tractable problems such as decodable POMDPs. Second, even in known sample-efficient problems such as revealing POMDPs (Jin et al., 2020c; Liu et al., 2022a) , existing sample complexities involve large polynomial factors of relevant problem parameters that are likely far from sharp. Third, relatively few principles are known for designing sample-efficient algorithms in POMDPs/PSRs, such as spectral or tensor-based approaches (Hsu et al., 2012; Azizzadenesheli et al., 2016; Jin et al., 2020c) , maximum likelihood or density estimation (Liu et al., 2022a; Wang et al., 2022; Zhan et al., 2022) , or learning short-memory policies (Efroni et al., 2022; Uehara et al., 2022b) . This contrasts with fully observable RL where the space of sample-efficient algorithms is much more diverse (Agarwal et al., 2019) . It is an important question whether we can expand the space of algorithms for partially observable RL. This paper advances all three aspects above for partially observable RL. We define B-stablility, a natural and general structural condition for PSRs, and design sharp algorithms for learning any B-stable PSR sample-efficiently. Our contributions can be summarized as follows. • We identify a new structural condition for PSRs termed B-stability, which simply requires its Brepresentation (or observable operators) to be bounded in a suitable operator norm (Section 3.1). B-stable PSRs subsume most known tractable subclasses such as revealing POMDPs, decodable POMDPs, low-rank future-sufficient POMDPs, and regular PSRs (Section 3.2). • We show that B-stable PSRs can be learned sample-efficiently by three algorithms simultaneously with sharp sample complexities (Section 4): Optimistic Maximum Likelihood Estimation (OMLE), Explorative Estimation-to-Decisions (EXPLORATIVE E2D), and Model-based Optimistic Posterior Sampling (MOPS). To our best knowledge, the latter two algorithms are first shown to be sample-efficient in partially observable RL. • Our sample complexities improve substantially over the current best when instantiated in both regular PSRs (Section 4.1) and known tractable subclasses of POMDPs (Section 5). For example, for m-step α rev -revealing POMDPs with S latent states, our algorithms find an ε nearoptimal policy within r O `S2 A m log N {pα 2 rev ε 2 q ˘episodes of play (with S 2 {α 2 rev replaced by



For ν-future-sufficient POMDPs, Wang et al. (2022)'s sample complexity depends on γ, which is an additional l-step past-sufficiency parameter that they require.



); Liu et al. (2022a); Wang et al. (2022); Efroni et al. (

