POPGYM: BENCHMARKING PARTIALLY OBSERVABLE REINFORCEMENT LEARNING

Abstract

Real world applications of Reinforcement Learning (RL) are often partially observable, thus requiring memory. Despite this, partial observability is still largely ignored by contemporary RL benchmarks and libraries. We introduce Partially Observable Process Gym (POPGym), a two-part library containing (1) a diverse collection of 15 partially observable environments, each with multiple difficulties and (2) implementations of 13 memory model baselines -the most in a single RL library. Existing partially observable benchmarks tend to fixate on 3D visual navigation, which is computationally expensive and only one type of POMDP. In contrast, POPGym environments are diverse, produce smaller observations, use less memory, and often converge within two hours of training on a consumer-grade GPU. We implement our high-level memory API and memory baselines on top of the popular RLlib framework, providing plug-andplay compatibility with various training algorithms, exploration strategies, and distributed training paradigms. Using POPGym, we execute the largest comparison across RL memory models to date.

1. INTRODUCTION

Datasets like MNIST (Lecun et al., 1998) have driven advances in Machine Learning (ML) as much as new architectural designs (Levine et al., 2020) . Comprehensive datasets are paramount in assessing the progress of learning algorithms and highlighting shortcomings of current methodologies. This is evident in the context of RL, where the absence of fast and comprehensive benchmarks resulted in a reproducability crisis (Henderson et al., 2018) . Large collections of diverse environments, like the Arcade Learning Environment, OpenAI Gym, ProcGen, and DMLab provide a reliable measure of progress in deep RL. These fundamental benchmarks play a role in RL equivalent to that of MNIST in supervised learning (SL). The vast majority of today's RL benchmarks are designed around Markov Decision Processes (MDPs). In MDPs, agents observe a Markov state, which contains all necessary information to solve the task at hand. When the observations are Markov states, the Markov property is satisfied, and traditional RL approaches guarantee convergence to an optimal policy (Sutton & Barto, 2018, Chapter 3). But in many RL applications, observations are ambiguous, incomplete, or noisy -any of which makes the MDP partially observable (POMDP) (Kaelbling et al., 1998) , breaking the Markov property and all convergence guarantees. Furthermore, Ghosh et al. ( 2021) find that policies trained under the ideal MDP framework cannot generalize to real-world conditions when deployed, with epistemic uncertainty turning real-world MDPs into POMDPs. By introducing memory (referred to as sequence to sequence models in SL), we can summarize the observationsfoot_0 therefore restoring policy convergence guarantees for POMDPs (Sutton & Barto, 2018, Chapter 17.3) . Despite the importance of memory in RL, most of today's comprehensive benchmarks are fully observable or near-fully observable. Existing partially observable benchmarks are often navigationbased -representing only spatial POMDPs, and ignoring applications like policymaking, disease diagnosis, teaching, and ecology (Cassandra, 1998) . The state of memory-based models in RL libraries is even more dire -we are not aware of any RL libraries that implement more than three or four distinct memory baselines. In nearly all cases, these memory models are limited to frame stacking and LSTM. To date, there are no popular RL libraries that provide a diverse selection of memory models. Of the few existing POMDP benchmarks, even fewer are comprehensive and diverse. As a consequence, there are no large-scale studies comparing memory models in RL. We propose to fill these three gaps with our proposed POPGym.

1.1. CONTRIBUTIONS

POPGym is a collection of 15 partially observable gym environments (Figure 1 ) and 13 memory baselines. All environments come with at least three difficulty settings and randomly generate levels to prevent overfitting. The POPGym environments use low-dimensional observations, making them fast and memory efficient. Many of our baseline models converge in under two hours of training on a single consumer-grade GPU ( Table 1 , Figure 2 ). The POPGym memory baselines utilize a simple API built on top of the popular RLlib library (Liang et al., 2018) , seamlessly integrating memory models with an assortment of RL algorithms, sampling, exploration strategies, logging frameworks, and distributed training methodologies. Utilizing POPGym and its memory baselines, we execute a large-scale evaluation, analyzing the capabilities of memory models on a wide range of tasks. To summarize, we contribute: 1. A comprehensive collection of diverse POMDP tasks. 2. The largest collection of memory baseline implementations in an RL library. 3. A large-scale, principled comparison across memory models.

2. RELATED WORK

There are many existing RL benchmarks, which we categorize as fully (or near-fully) observable and partially observable. In near-fully observable environments, large portions of the the Markov state are visible in an observation, though some information may be missing. We limit our literature review to comprehensive benchmarks (those that contain a wide set of tasks), as environment diversity is essential for the accurate evaluation of RL agents (Cobbe et al., 2020) .

2.1. FULLY AND NEAR-FULLY OBSERVABLE BENCHMARKS

The Arcade Learning Environment (ALE) (Bellemare et al., 2013) wraps Atari 2600 ROMs in a Python interface. Most of the Atari games, such as Space Invaders or Missile Command are fully observable (Cobbe et al., 2020) . Some games like asteroids require velocity observations, but models can recover full observability by stacking four consecutive observations (Mnih et al., 2015) , an approach that does not scale for longer timespans. Even seemingly partially-observable multi-room games like Montezuma's Revenge are made near-fully observable by displaying the player's score and inventory (Burda et al., 2022) . OpenAI Gym (Brockman et al., 2016) came after ALE, implementing classic fully observable RL benchmarks like CartPole and MountainCar. Their Gym API found use in many other environments, including our proposed benchmark.



Strictly speaking, the agent actions are also required to guarantee convergence. We consider the previous action as part of the current observation.



Figure 1: Renders from select POPGym environments.

availability

POPGym is available at https: //github.com/proroklab/popgym.

