MEMORY GYM: PARTIALLY OBSERVABLE CHALLENGES TO MEMORY-BASED AGENTS

Abstract

Memory Gym is a novel benchmark for challenging Deep Reinforcement Learning agents to memorize events across long sequences, be robust to noise, and generalize. It consists of the partially observable 2D and discrete control environments Mortar Mayhem, Mystery Path, and Searing Spotlights. These environments are believed to be unsolvable by memory-less agents because they feature strong dependencies on memory and frequent agent-memory interactions. Empirical results based on Proximal Policy Optimization (PPO) and Gated Recurrent Unit (GRU) underline the strong memory dependency of the contributed environments. The hardness of these environments can be smoothly scaled, while different levels of difficulty (some of them unsolved yet) emerge for Mortar Mayhem and Mystery Path. Surprisingly, Searing Spotlights poses a tremendous challenge to GRU-PPO, which remains an open puzzle. Even though the randomly moving spotlights reveal parts of the environment's ground truth, environmental ablations hint that these pose a severe perturbation to agents that leverage recurrent model architectures as their memory.

1. INTRODUCTION

Memory is a vital mechanism of intelligent living beings to make favorable decisions sequentially under imperfect information and uncertainty. One's immediate sensory perception may not suffice if information from past events cannot be recalled. Reasoning, imagination, planning, and learning are skills that may become unattainable. When developing autonomously learning decision-making agents, the agent's memory mechanism is required to maintain a representation of former observations to ground its next decision. Adding memory mechanisms as a recurrent neural network (Werbos, 1990) or a transformer (Vaswani et al., 2017) led to successfully learned policies in both virtual and real-world tasks. For instance, Deep Reinforcement Learning (DRL) methods master complex video games such as StarCraft II (Vinyals et al., 2019), and DotA 2 (Berner et al., 2019) . Examples of successes in real-world problems are dexterous in-hand manipulation (Andrychowicz et al., 2020) and controlling tokamak plasmas (Degrave et al., 2022) . In addition to leveraging memory, these tasks require vast amounts of computation resources and additional methods (e.g. domain randomization, incorporating domain knowledge, ect.), which make them undesirable for solely benchmarking an agent's ability to interact with its memory meaningfully. We propose Memory Gym as a novel and open source benchmark consisting of three unique environments: Mortar Mayhem, Mystery Path, and Searing Spotlights. These environments challenge memory-based agents to memorize events across long sequences, generalize, be robust to noise, and be sample efficient. By accomplishing the desiderata that we define in this work, we believe that Memory Gym has the potential to complement existing benchmarks and therefore accelerate the development of DRL agents leveraging memory. All three environments feature visual observations, discrete action spaces, and are notably not solvable without memory. This allows users to assure early on whether their developed memory mechanism is working or not. To fit the problem of sequential decision-making, agents have to frequently leverage their memory to solve the posed tasks by Memory Gym. Several related environments ask the agent only to memorize initial cues, which require infrequent agent-memory interactions. Our environments are smoothly configurable. This is useful to adapt the environment's difficulty to the available resources, while easier difficulties can be used as a proof of concept. Competent methods can show off themselves in new challenges or identify their limits in a profound way. All environments are procedurally generated to evaluate the agent's ability to generalize to unseen levels (or seeds). Due to the aforementioned configurable difficulties, the trained agent can be evaluated on out-of-distribution levels. Memory Gym's significant dependencies are gym (Brockman et al., 2016) and PyGamefoot_0 . This allows Memory Gym to be easily set up and executed on Linux, macOS, and Windows. Multiple thousands of agent-environment interactions per second are simulated by all environments. This paper proceeds as follows. We first define the memory benchmarks' criteria and portray the related benchmarks' landscape. Next, Mortar Mayhem, Mystery Path, and Searing Spotlights are detailed. Afterward, we show that memory is crucial in our environments by conducting empirical experiments using a recurrent implementation (GRU-PPO) of Proximal Policy Optimization (PPO) (Schulman et al., 2017) and HELM (Paischer et al., 2022) . When scaling the hardness in Mortar Mayhem and Mystery Path, a full range of difficulty levels emerge. Searing Spotlights remains unsolved because the recurrent agent is volatile to perturbations of the environment's core mechanic: the randomly wandering spotlights. This observation is also apparent when training on a single Procgen BossFight (Cobbe et al., 2020) level under the same spotlight perturbations as in Searing Spotlights. At last, this work concludes and enumerates future work.

2. COMPARISON OF RELATED MEMORY BENCHMARKS

2.1 DESIDERATA OF MEMORY BENCHMARKS Before detailing related benchmarks, we define the aforementioned desiderata that we believe are essential to memory benchmarks and benchmarks in general. Accessibility refers to the competence to easily set up and execute the environment. Benchmarks shall be publicly available, acceptably documented, and open source while running on the commonly used operating systems Linux, macOS, and Windows. Linux, in general, is important because many high-performance computing (HPC) facilities employ this operating system. As HPC facilities might not support virtualization, benchmarks should not be solely deployed as a docker image or similar. At last, benchmarks shall run headless because otherwise, these potentially rely on dependencies like xvfb or EGL, which HPC facilities may not support as well. Suppose relevant benchmark details, such as environment dynamics, are missing. In that case, it can be desirable to support humanly playable environments so that these can be explored in advance. Fast simulation speeds, which achieve thousands of steps per second (i.e. FPS), allow training runs to be more wall time efficient, enabling to upscale experiments and their repetitions or faster development iterations. The benchmark's speed also depends on the time it takes to reset the environment to set up a new episode for the agent to play. Towards maxing out FPS on a single machine, benchmarks shall be able to run multiple instances of their environments concurrently. High diversity attributes environments that offer a large distribution of procedurally generated levels to reasonably challenge an agent's ability to generalize (Cobbe et al., 2020) . Also, it is desirable to implement smoothly configurable environments to evaluate the agent at out-of-distribution levels. Scalable difficulty is a property that shall make environments controllable such that their current hardness can be increased or decreased. Easier environments can have benefits: a proof-of-concept state is sooner reachable while developing novel methods, and research groups can fit the difficulty to their available resources. Moreover, increasing the difficulty ensures that already competent methods may prove themselves in new challenges to demonstrate their abilities or limits. Strong dependency on memory refers to tasks that are only solvable if the agent can recall past information (i.e. successfully leveraging its memory). Section 2.4 describes partially observable environments that can be solved to some extent without memory. While memory-based agents might more efficiently solve these tasks, these do not guarantee that the agent's memory is working. To ensure that the utilized memory mechanism is working and does not suffer from bugs, this criterion cannot be omitted by benchmarks targeting specifically the agent's memory.



https://www.pygame.org

availability

https: //github.com/MarcoMeter/drl-memory-gym/ 

