PARTIALLY OBSERVABLE RL WITH B-STABILITY: UNIFIED STRUCTURAL CONDITION AND SHARP SAMPLE-EFFICIENT ALGORITHMS

Abstract

Partial Observability-where agents can only observe partial information about the true underlying state of the system-is ubiquitous in real-world applications of Reinforcement Learning (RL). Theoretically, learning a near-optimal policy under partial observability is known to be hard in the worst case due to an exponential sample complexity lower bound. Recent work has identified several tractable subclasses that are learnable with polynomial samples, such as Partially Observable Markov Decision Processes (POMDPs) with certain revealing or decodability conditions. However, this line of research is still in its infancy, where (1) unified structural conditions enabling sample-efficient learning are lacking; (2) existing sample complexities for known tractable subclasses are far from sharp; and (3) fewer sample-efficient algorithms are available than in fully observable RL. This paper advances all three aspects above for Partially Observable RL in the general setting of Predictive State Representations (PSRs). First, we propose a natural and unified structural condition for PSRs called B-stability. B-stable PSRs encompasses the vast majority of known tractable subclasses such as weakly revealing POMDPs, low-rank future-sufficient POMDPs, decodable POMDPs, and regular PSRs. Next, we show that any B-stable PSR can be learned with polynomial samples in relevant problem parameters. When instantiated in the aforementioned subclasses, our sample complexities improve substantially over the current best ones. Finally, our results are achieved by three algorithms simultaneously: Optimistic Maximum Likelihood Estimation, Estimation-to-Decisions, and Model-Based Optimistic Posterior Sampling. The latter two algorithms are new for sample-efficient learning of POMDPs/PSRs. We additionally design a variant of the Estimation-to-Decisions algorithm to perform sample-efficient all-policy model estimation for B-stable PSRs, which also yields guarantees for reward-free learning as an implication.

1. INTRODUCTION

Partially Observable Reinforcement Learning (RL)-where agents can only observe partial information about the true underlying state of the system-is ubiquitous in real-world applications of RL such as robotics (Akkaya et al., 2019 ), strategic games (Brown & Sandholm, 2018; Vinyals et al., 2019; Berner et al., 2019 ), economic simulation (Zheng et al., 2020) , and so on. Partially observable RL defies standard efficient approaches for learning and planning in the fully observable case (e.g. those based on dynamical programming) due to the non-Markovian nature of the observations (Jaakkola et al., 1994) , and has been a hard challenge for RL research. Theoretically, it is well-established that learning in partial observable RL is statistically hard in the worst case-In the standard setting of Partially Observable Markov Decision Processes (POMDPs), learning a near-optimal policy has an exponential sample complexity lower bound in the horizon length (Mossel & Roch, 2005; Krishnamurthy et al., 2016) , which in stark contrast to fully observable MDPs where polynomial sample complexity is possible (Kearns & Singh, 2002; Jaksch et al., ˚Equal contribution.

