EVALUATING LONG-TERM MEMORY IN 3D MAZES

Abstract

Intelligent agents need to remember salient information to reason in partiallyobserved environments. For example, agents with a first-person view should remember the positions of relevant objects even if they go out of view. Similarly, to effectively navigate through rooms agents need to remember the floor plan of how rooms are connected. However, most benchmark tasks in reinforcement learning do not test long-term memory in agents, slowing down progress in this important research direction. In this paper, we introduce the Memory Maze, a 3D domain of randomized mazes specifically designed for evaluating long-term memory in agents. Unlike existing benchmarks, Memory Maze measures long-term memory separate from confounding agent abilities and requires the agent to localize itself by integrating information over time. With Memory Maze, we propose an online reinforcement learning benchmark, a diverse offline dataset, and an offline probing evaluation. Recording a human player establishes a strong baseline and verifies the need to build up and retain memories, which is reflected in their gradually increasing rewards within each episode. We find that current algorithms benefit from training with truncated backpropagation through time and succeed on small mazes, but fall short of human performance on the large mazes, leaving room for future algorithmic designs to be evaluated on the Memory Maze.

1. INTRODUCTION

Deep reinforcement learning (RL) has made tremendous progress in recent years, outperforming humans on Atari games (Mnih et al., 2015; Badia et al., 2020) , board games (Silver et al., 2016; Schrittwieser et al., 2019) , and advances in robot learning (Akkaya et al., 2019; Wu et al., 2022) . Much of this progress has been driven by the availability of challenging benchmarks that are easy to use and allow for standardized comparison (Bellemare et al., 2013; Tassa et al., 2018; Cobbe et al., 2020) . What is more, the RL algorithms developed on these benchmarks are often general enough

Agent Inputs

Underlying Trajectory t = 0 30 60 90 120 150 Figure 1 : The first 150 time steps of an episode in the Memory Maze 9x9 environment. The bottom row shows the top-down view of a randomly generated maze with 3 colored objects. The agent only observes the first-person view (top row) which includes a prompt for the next object to find as a border of the corresponding color. The agent receives +1 reward when it reaches the object of the prompted color. During the episode, the agent has to visit the same objects multiple times, testing its ability to memorize their positions, the way the rooms are connected, and its own location. to solve completely unrelated challenges, such as finetuning large language models from human preferences (Ziegler et al., 2019 ), optimizing video compression parameters (Mandhane et al., 2022) , or promising results in controlling the plasma of nuclear fusion reactors (Degrave et al., 2022) . Despite the progress in RL, many current algorithms are still limited to environments that are mostly fully observed and struggle in partially-observed scenarios where the agent needs to integrate and retain information over many time steps (Wayne et al., 2018) . Despite this, the ability to remember over long time horizons is a central aspect of human intelligence and a major limitation on the applicability of current algorithms. While many existing benchmarks are partially observable to some extent, memory is rarely the limiting factor of agent performance (Oh et al., 2015; Cobbe et al., 2020; Beattie et al., 2016; Hafner, 2021) . Instead, these benchmarks evaluate a wide range of skills at once, making it challenging to measure improvements in an agent's ability to remember. Ideally, we would like a memory benchmark to fulfill the following requirements: (1) isolate the challenge of long-term memory from confounding challenges such as exploration and credit assignment, so that performance improvements can be attributed to better memory. (2) The tasks should challenge an average human player but be solvable for them, giving an estimate of how far current algorithms are away from human memory abilities. (3) The task requires remembering multiple pieces of information rather than a single bit or position, e.g. whether to go left or right at the end of a long corridor. (4) The benchmark should be open source and easy to use. We introduce the Memory Maze, a benchmark platform for evaluating long-term memory in RL agents and sequence models. The Memory Maze features randomized 3D mazes in which the agent is tasked with repeatedly navigating to one of the multiple objects. To find the objects quickly, the agent has to remember their locations, the wall layout of the maze, as well as its own location. The contributions of this paper are summarized as follows: • Environment We introduce the Memory Maze environment, which is specifically designed to measure memory isolated from other challenges and overcomes the limitations of existing benchmarks. We open source the environment and make it easy to install and use. • Human Performance We record the performance of a human player and find that the benchmark is challenging but solvable for them. This offers an estimate of how far current algorithms are from the memory ability of a human. • Memory Challenge We confirm that memory is indeed the leading challenge in this benchmark, by observing that the rewards of the human player increases within each episode, as well as by finding strong improvements of training agents with truncated backpropagation through time. • Offline Dataset We collect a diverse offline dataset that includes semantic information, such as the top-down view, object positions, and the wall layout. This enables offline RL as well as evaluating representations through probing of both task-specific and task-agnostic information. • Baseline Scores We benchmark a strong model-free and model-based agent on the four sizes of the Memory Maze and find that they make progress on the smaller mazes but lag far behind human performance on the larger mazes, showing that the benchmark is of appropriate difficulty.

2. RELATED WORK

Several benchmarks for measuring memory abilities have been proposed. This section summarizes important examples and discusses the limitations that motivated the design of the Memory Maze. Moreover, DMLab features a skyline in the background that makes it trivial for the agent to localize itself, so the agent does not need to remember its location in the maze. (Wayne et al., 2018 ) used a larger battery of tasks, but only a subset of those was included in the public release of DMLab. (Gregor et al., 2019) studied the memory abilities of agents by probing representations and compared a range of agent objectives and memory mechanisms, an approach that we build upon in this paper. However, their datasets and implementations were not released, making it difficult for the research community to build upon the work. A standardized probe benchmark is available for Atari (Anand et al., 2019) , but those tasks require almost no memory.



(Beattie et al., 2016)  features various tasks, some of which require memory among other challenges. Parisotto et al. (2020) identified a subset of 8 DMLab tasks relating to memory but these tasks have largely been solved by R2D2 and IMPALA (see Figure11inKapturowski et al. (2018)).

availability

https://github.com/jurgisp/

