EVALUATING AGENTS WITHOUT REWARDS

Abstract

Reinforcement learning has enabled agents to solve challenging control tasks from raw image inputs. However, manually crafting reward functions can be time consuming, expensive, and prone to human error. Competing objectives have been proposed for agents to learn without external supervision, such as artificial input entropy, information gain, and empowerment. Estimating these objectives can be challenging and it remains unclear how well they reflect task rewards or human behavior. We study these objectives across seven agents and three Atari games. Retrospectively computing the objectives from the agent's lifetime of experience simplifies accurate estimation. We find that all three objectives correlate more strongly with a human behavior similarity metric than with task reward. Moreover, input entropy and information gain both correlate more strongly with human similarity than task reward does.

Metric Reward Correlation

: Correlation coefficients between each metric and task reward or human similarity. The 3 task-agnostic metrics correlate more strongly with human similarity than with task reward. This suggests that typical RL tasks may not be a sufficient proxy for intelligent behavior seen in humans playing the same games. Deep reinforcement learning (RL) has enabled agents to solve complex tasks directly from high-dimensional image inputs, such as locomotion (Heess et al., 2017) , robotic manipulation (Akkaya et al., 2019) , and game playing (Mnih et al., 2015; Silver et al., 2017) . However, many of these successes are built upon rich supervision in the form of manually defined reward functions. Unfortunately, designing informative reward functions is often expensive, time-consuming, and prone to human error (Krakovna et al., 2020) . Furthermore, these difficulties increase with the complexity of the task of interest. In contrast to many RL agents, natural agents generally learn without externally provided tasks, through intrinsic objectives. For example, children explore the world by crawling around and playing with objects they find. Inspired by this, the field of intrinsic motivation (Schmidhuber, 1991; Oudeyer et al., 2007) seeks mathematical objectives for RL agents that do not depend on a specific task and can be applicable to any unknown environment. We study three common types of intrinsic motivation: • Input entropy encourages encountering rare sensory inputs, measured by a learned density model (Schmidhuber, 1990; Bellemare et al., 2016b; Pathak et al., 2017; Burda et al., 2018b ). • Information gain, or infogain for short, rewards the agent for discovering the rules of its environment (Lindley et al., 1956; Houthooft et al., 2016; Shyam et al., 2018; Sekar et al., 2020) . • Empowerment measures the agent's influence it has over its sensory inputs or environment (Klyubin et al., 2005; Mohamed and Rezende, 2015; Karl et al., 2017) . Despite the empirical success of intrinsic motivation for facilitating exploration (Bellemare et al., 2016b; Burda et al., 2018b) , it remains unclear which family of intrinsic objectives is best for a given scenario, for example when task rewards are sparse or unavailable, or when the goal is to behave similarly to human actors. Moreover, it is not clear whether different intrinsic objectives offer similar benefits in practice or are orthogonal and should be combined. To spur progress toward better understanding of intrinsic objectives, we empirically compare the three objective families in terms of their correlation with human behavior and with the task rewards of three Atari games and Minecraft Treechop. The goal of this paper is to gain understanding rather than to propose a new intrinsic objective or exploration agent. Therefore, there is no need to estimate intrinsic objectives while the agents are learning, which often requires complicated approximations. Instead, we train several well-known RL agents on three Atari games and Minecraft and store their lifetime datasets of experience, resulting in a total of 2.1 billion time steps and about 9 terabytes of agent experience. From the dataset of each agent, we compute the human similarity, input entropy, empowerment, and infogain using simple estimators with clearly stated assumptions. We then analyze the correlations between these metrics to understand how they relate to another and how well they reflect task reward and human similarity. The key findings of this paper are summarized as follows: • Input entropy and information gain both correlate better with human similarity than task reward does. This implies that to measure how similar an agent's behavior is to human behavior, input entropy is a better approximation than task reward. • Simple implementations of input entropy, information gain, and empowerment correlate well with human similarity. This suggests that they can be used as task-agnostic evaluation metrics when human data and task rewards are unavailable. • As a consequence of the these two findings, task-agnostic metrics can be used to measure a different component of agent behavior than is measured by the task rewards of the reinforcement learning environments considered in our study. • Input entropy and information gain correlate strongly with each other, but to a lesser degree with empowerment. This suggests that optimizing input entropy together with either of the two other metrics could be beneficial for designing exploration methods. This paper is structured as follows. Section 2 describes the games and agents used for the study. Section 3 details the experimental setup and estimators used to implement the metrics. Section 4 discusses quantitative and qualitative results. Section 5 summarizes key take-aways and recommendations.

2. BACKGROUND

To validate the effectiveness of our metrics for task-agnostic evaluation across a wide spectrum of agent behavior, we retrospectively computed our metrics on the lifetime experience of well-known RL agents. Thus, we first collected datasets of a variety of agent behavior on which to compute and evaluate our metrics. Environments We evaluated our agents in three different Atari environments provided by Arcade Learning Environment (Bellemare et al., 2013) : Breakout, Seaquest, and Montezuma's Revenge, and additionally in the Minecraft Treechop environment provided in MineRL (Guss et al., 2019) . Breakout and Seaquest are relatively simple reactive environments, while Montezuma's Revenge is a challenging platformer requiring long-term planning. Treechop is a 3D environment in which the agent receives reward for breaking and collecting wood blocks, but has considerable freedom to explore the world. We chose these four environments because they span a range of complexity, freedom, and difficulty, as detailed in Appendix E. Agents The seven agent configurations represented in our dataset include three RL algorithms and two trivial agents for comparison. We selected RL agents spanning the range from extrinsic task reward only to intrinsic motivation reward only. Additionally, we included random and no-op agents, two opposite extremes of trivial behavior. Our goal was to represent a wide range of behaviors: playing to achieve a high score, playing to explore the environment, and taking actions without regard to the environment. Specifically, we used the PPO agent (Schulman et al., 2017) trained to optimize task reward, and the RND (Burda et al., 2018b) and ICM (Pathak et al., 2017) exploration agents using PPO as an optimizer, which can incorporate both an intrinsic reward signal, and an extrinsic reward signal which can be enabled for task-specific behavior, or disabled for task-agnostic behavior. We evaluate RND and ICM in both of these configurations. Each agent is summarized in Appendix F.

