POPGYM: BENCHMARKING PARTIALLY OBSERVABLE REINFORCEMENT LEARNING

Abstract

Real world applications of Reinforcement Learning (RL) are often partially observable, thus requiring memory. Despite this, partial observability is still largely ignored by contemporary RL benchmarks and libraries. We introduce Partially Observable Process Gym (POPGym), a two-part library containing (1) a diverse collection of 15 partially observable environments, each with multiple difficulties and (2) implementations of 13 memory model baselines -the most in a single RL library. Existing partially observable benchmarks tend to fixate on 3D visual navigation, which is computationally expensive and only one type of POMDP. In contrast, POPGym environments are diverse, produce smaller observations, use less memory, and often converge within two hours of training on a consumer-grade GPU. We implement our high-level memory API and memory baselines on top of the popular RLlib framework, providing plug-andplay compatibility with various training algorithms, exploration strategies, and distributed training paradigms. Using POPGym, we execute the largest comparison across RL memory models to date.

1. INTRODUCTION

Datasets like MNIST (Lecun et al., 1998) have driven advances in Machine Learning (ML) as much as new architectural designs (Levine et al., 2020) . Comprehensive datasets are paramount in assessing the progress of learning algorithms and highlighting shortcomings of current methodologies. This is evident in the context of RL, where the absence of fast and comprehensive benchmarks resulted in a reproducability crisis (Henderson et al., 2018) . Large collections of diverse environments, like the Arcade Learning Environment, OpenAI Gym, ProcGen, and DMLab provide a reliable measure of progress in deep RL. These fundamental benchmarks play a role in RL equivalent to that of MNIST in supervised learning (SL). The vast majority of today's RL benchmarks are designed around Markov Decision Processes (MDPs). In MDPs, agents observe a Markov state, which contains all necessary information to solve the task at hand. When the observations are Markov states, the Markov property is satisfied, and traditional RL approaches guarantee convergence to an optimal policy (Sutton & Barto, 2018, Chapter 3). But in many RL applications, observations are ambiguous, incomplete, or noisy -any of which makes the MDP partially observable (POMDP) (Kaelbling et al., 1998) , breaking the Markov property and all convergence guarantees. Furthermore, Ghosh et al. ( 2021) find that policies trained under the ideal MDP framework cannot generalize to real-world conditions when deployed, with epistemic uncertainty turning real-world MDPs into POMDPs. By introducing memory (referred to as sequence to sequence models in SL), we can summarize the observationsfoot_0 therefore restoring policy convergence guarantees for POMDPs (Sutton & Barto, 2018, Chapter 17.3) . Despite the importance of memory in RL, most of today's comprehensive benchmarks are fully observable or near-fully observable. Existing partially observable benchmarks are often navigationbased -representing only spatial POMDPs, and ignoring applications like policymaking, disease diagnosis, teaching, and ecology (Cassandra, 1998) . The state of memory-based models in RL libraries is even more dire -we are not aware of any RL libraries that implement more than three



Strictly speaking, the agent actions are also required to guarantee convergence. We consider the previous action as part of the current observation.

availability

POPGym is available at https: //github.com/proroklab/popgym.

