REPRESENT TO CONTROL PARTIALLY OBSERVED SYS-TEMS: REPRESENTATION LEARNING WITH PROVABLE SAMPLE EFFICIENCY

Abstract

Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Represent to Control (RTC), which learns the representation at two levels while optimizing the policy. (i) For each step, RTC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, RTC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, RTC attains an O(1/ϵ 2 ) sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here ϵ is the optimality gap. To our best knowledge, RTC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.

1. INTRODUCTION

Deep reinforcement learning demonstrates significant empirical successes in Markov decision processes (MDPs) with large state spaces (Mnih et al., 2013; 2015; Silver et al., 2016; 2017) . Such empirical successes are attributed to the integration of representation learning into reinforcement learning. In other words, mapping the state to a low-dimensional feature enables model/value learning and optimal control in a sample-efficient manner. Meanwhile, it becomes more theoretically understood that the low-dimensional feature is the key to sample efficiency in the linear setting (Cai et al., 2020; Jin et al., 2020b; Ayoub et al., 2020; Agarwal et al., 2020; Modi et al., 2021; Uehara et al., 2021) . In contrast, partially observed Markov decision processes (POMDPs) with large observation and state spaces remain significantly more challenging. Due to a lack of the Markov property, the lowdimensional feature of the observation at each step is insufficient for the prediction and control of the future (Sondik, 1971; Papadimitriou and Tsitsiklis, 1987; Coates et al., 2008; Azizzadenesheli et al., 2016; Guo et al., 2016) . Instead, it is necessary to obtain a low-dimensional embedding of the history, which assembles the low-dimensional features across multiple steps (Hefny et al., 2015; Sun et al., 2016) . In practice, learning such features and embeddings requires various heuristics, e.g., recurrent neural network architectures and auxiliary tasks (Hausknecht and Stone, 2015; Li et al., 2015; Mirowski et al., 2016; Girin et al., 2020) . In theory, the best results are restricted to the tabular setting (Azizzadenesheli et al., 2016; Guo et al., 2016; Jin et al., 2020a; Liu et al., 2022) , which does not involve representation learning. To this end, we identify a class of POMDPs with a low-rank structure on the state transition kernel (but not on the observation emission kernel), which allows prediction and control in a sampleefficient manner. More specifically, the transition admits a low-rank factorization into two unknown features, whose dimension is the rank. On top of the low-rank transition, we define a Bellman operator, which performs a forward update for any finite-length trajectory. The Bellman operator allows us to further factorize the history across multiple steps to obtain its embedding, which assembles the per-step feature. By integrating the two levels of representation learning, that is, (i) feature learning at each step and (ii) embedding learning across multiple steps, we propose a sample-efficient algorithm, namely Represent to Control (RTC), for POMDPs with infinite observation and state spaces. The key to RTC is balancing exploitation and exploration along the representation learning process. To this end, we construct a confidence set of embeddings upon identifying and estimating the Bellman operator, which further allows efficient exploration via optimistic planning. It is worth mentioning that such a unified framework allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). We analyze the sample efficiency of RTC under the future and past sufficiency assumptions. In particular, such assumptions ensure that the future and past observations are sufficient for identifying the belief state, which captures the information-theoretic difficulty of POMDPs. We prove that RTC attains an O(1/ϵ 2 ) sample complexity that scales polynomially with the horizon and the dimension of the feature (that is, the rank of the transition). Here ϵ is the optimality gap. The polynomial dependency on the horizon is attributed to embedding learning across multiple steps, while polynomial dependency on the dimension is attributed to feature learning at each step, which is the key to bypassing the infinite sizes of the observation and state spaces. Contributions. In summary, our contribution is threefold. • We identify a class of POMDPs with the low-rank transition, which allows representation learning and reinforcement learning in a sample-efficient manner. • We propose RTC, a principled approach integrating embedding and control in the low-rank POMDP. • We establish the sample efficiency of RTC in the low-rank POMDP with infinite observation and state spaces. Related Work. Our work follows the previous studies of POMDPs. In general, solving a POMDP is intractable from both the computational and the statistical perspectives (Papadimitriou and Tsitsiklis, 1987; Vlassis et al., 2012; Azizzadenesheli et al., 2016; Guo et al., 2016; Jin et al., 2020a) . Given such computational and statistical barriers, previous works attempt to identify tractable POMDPs. 



In particular, Azizzadenesheli et al. (2016); Guo et al. (2016); Jin et al. (2020a); Liu et al. (2022) consider the tabular POMDPs with (left) invertible emission matrices. Efroni et al. (2022) considers the POMDPs where the state is fully determined by the most recent observations of a fixed length. Cayci et al. (2022) analyze POMDPs where a finite internal state can approximately determine the state. In contrast, we analyze POMDPs with the low-rank transition and allow the state and observation spaces to be arbitrarily large. Meanwhile, our analysis hinges on the future and past sufficiency assumptions, which only require that the density of the state is identified by that of the future and

