DIAGNOSING AND EXPLOITING THE COMPUTATIONAL DEMANDS OF VIDEOS GAMES FOR DEEP REINFORCE-MENT LEARNING

Abstract

Humans learn by interacting with their environments and perceiving the outcomes of their actions. A landmark in artificial intelligence has been the development of deep reinforcement learning (dRL) algorithms capable of doing the same in video games, on par with or better than humans. However, it remains unclear whether the successes of dRL models reflect advances in visual representation learning, the effectiveness of reinforcement learning algorithms at discovering better policies, or both. To address this question, we introduce the Learning Challenge Diagnosticator (LCD), a tool that separately measures the perceptual and reinforcement learning demands of a task. We use LCD to discover a novel taxonomy of challenges in the Procgen benchmark, and demonstrate that these predictions are both highly reliable and can instruct algorithmic development. More broadly, the LCD reveals multiple failure cases that can occur when optimizing dRL algorithms over entire video game benchmarks like Procgen, and provides a pathway towards more efficient progress.

1. INTRODUCTION

Gibson famously argued that "The function of vision is not to solve the inverse problem and reconstruct a veridical description of the physical world. [... It] is to keep perceivers in contact with behaviorally relevant properties of the world they inhabit" (reviewed in Warren 2021). The field of deep reinforcement learning (dRL) has followed Gibson's tenet since the seminal introduction of deep Q-networks (DQN) (Mnih et al., 2015) . DQNs rely on reward feedback to train their policies and perceptual systems at the same time to learn to play games from a tabula rasa. This end-to-end approach of training on individual environments and tasks has supported steady progress in the field of dRL, and newer reinforcement learning algorithms have yielded agents that achieve human or super-human performance in a variety of challenges -from Chess to Go and from Atari games to Starcraft (Mnih et al., 2015; Silver et al., 2017; 2018; Vinyals et al., 2019) . But Gibson also argued that the ecological niche of animals allows them to exploit task-agnostic mechanisms to simplify the perceptual or behavioral demands of important tasks, like how humans rely on optic flow for navigation (Warren, 2021). In the decades since Gibson's writings, it has been found that humans can efficiently find or learn bespoke perceptual features that aid performance on a single task (Li et al., 2004; Scott et al., 2007; Roelfsema et al., 2010; Emberson, 2017) , or they can exploit previously learned generalist representations and task abstractions that are useful across multiple tasks and environments (Wiesel & Hubel, 1963; Watanabe et al., 2001; Emberson, 2017; Lehnert et al., 2020; O'Reilly, 2001) . While there have been attempts at building similarly flexible dRL agents through meta-reinforcement learning (Frans et al., 2018; Xu et al., 2018; 2020; Houthooft et al., 2018; Gupta et al., 2018; Chelu et al., 2020; Pong et al., 2021) , these approaches ignore the complexities of perceptual learning and carry large computational burdens that limit them to simplistic scenarios. There is a pressing need for approaches to training dRL agents that can meet the computational demands of a wide variety of environments and tasks. One way to build generalist agents is to first reliably diagnose where the computational challenges of a given environment and task lie and adjust the agent to those demands. Is the perceptual challenge onerous? Is the reward signal for credit assignment especially sparse? Even partial answers to these questions are instructive for improving an agent, for instance, by pre-determining the extent to which

annex

it relies on feedback from the world to tune its policy and perception versus drawing from previously learned representations and task abstractions. The introduction of diverse video game challenges for dRL, such as the Procgen Benchmark (Cobbe et al., 2020) , can serve as a starting point for this investigation. For example, take the game "Plunder" from Procgen (Figure 1a ). Plunder has simple gameplay rules but poses a visual challenge: an agent is asked to shoot all objects that look like a provided cue. The difficulty here lies in assessing whether each object in the environment is the same or different than the cue; a visual routine that is difficult to learn for neural networks (Vaishnav et al., 2022; Kim et al., 2018) . For this reason, an agent that can draw from prior experience in learning relevant perceptual routines may perform better than one with a perceptual system tuned for this specific task from scratch. In contrast, the objects and environments of a game like "Leaper"

