DEEP LEARNING OF INTRINSICALLY MOTIVATED OP-TIONS IN THE ARCADE LEARNING ENVIRONMENT Anonymous

Abstract

In Reinforcement Learning, Intrinsic Motivation motivates directed behaviors through a wide range of reward-generating methods. Depending on the task and environment, these rewards can be useful, might complement each other, but can also break down entirely, as seen with the noisy TV problem for curiosity. We therefore argue that scalability and robustness, among others, are key desirable properties of a method to incorporate intrinsic rewards, which a simple weighted sum of reward lacks. In a tabular setting, Explore Options let the agent call an intrinsically motivated policy in order to learn from its trajectories. We introduce Deep Explore Options, revising Explore Options within the Deep Reinforcement Learning paradigm to tackle complex visual problems. Deep Explore Options can naturally learn from several unrelated intrinsic rewards, ignore harmful intrinsic rewards, learn to balance exploration, but also isolate exploitative and exploratory behaviors for independent usage. We test Deep Explore Options on hard and easy exploration games of the Atari Suite, following a benchmarking study to ensure fairness. Our empirical results show that they achieve similar results than weighted sum baselines, while maintaining their key properties.

1. INTRODUCTION

In Reinforcement Learning (RL), an agent is sequentially given states and needs to perform actions in order to maximize obtained extrinsic rewards r e . The agent is therefore deeply tied to the reward signal, and tends to fail when said signal is sparse or noisy. When the environment is very complex or high-dimensional, it is desirable for the agent to explore in a directed way (Thrun, 1992) , i.e. explicitly looking for new knowledge and experiences. One of the most common ways to generate such taskindependent, directed behaviors is through intrinsic motivation (IM), i.e. an alternative reward signal r i to spur curiosity and entice behavior exploration (Oudeyer & Kaplan, 2009; Schmidhuber, 2010) . IM biologically refers to the natural tendency of organisms to explore. One of the most common benchmarks for Deep RL agents has been the Arcade Learning Environment (ALE, Bellemare et al. ( 2013)), consisting of Atari video-games. In order to solve the most challenging, so-called hard-exploration games of the domain (Bellemare et al. ( 2016)), state-of-the-art Deep RL methods have integrated IM in complex learning mechanisms, and finally managed to overcome human-level play in all 57 games (NGU and Agent57, Badia et al. (2020b; a) ). However, these methods still fundamentally rely on a Weighted Sum (WS) of rewards r t = r e t + βr i t , for a very well-chosen and complex IM reward. So while the types of behaviors that we can extract with IM keep on expanding, we are still ultimately relying on a single signal to help exploitation. Instead, we might want to benefit from different and complementary intrinsic signals (Matusch et al., 2020) . In this paper, we refer to this challenge as IM Incorporation (IM-Inc). We extract several key desirable features of IM-Inc methods, including scalability, robustness and generality. In a tabular setting, Explore Options (EO, Bagot et al. ( 2020)) have been proposed as an alternative to the a weighted sum of rewards. The agent is divided into the Exploiter, trained exclusively with the extrinsic reward, and the Explorers, trained exclusively with the intrinsic rewards. The Exploiter can call any Explorers through options (Sutton et al., 1999) , i.e. additional actions, to explore for a fixed amount of time. However, because of their tabular nature, Explore Options are inadequate for function approximation; but it is a crucial element to allow the agent to generalize the option over states and effectively learn to balance exploration. We revise Explore Options into Deep Explore

