DEEP LEARNING OF INTRINSICALLY MOTIVATED OP-TIONS IN THE ARCADE LEARNING ENVIRONMENT Anonymous

Abstract

In Reinforcement Learning, Intrinsic Motivation motivates directed behaviors through a wide range of reward-generating methods. Depending on the task and environment, these rewards can be useful, might complement each other, but can also break down entirely, as seen with the noisy TV problem for curiosity. We therefore argue that scalability and robustness, among others, are key desirable properties of a method to incorporate intrinsic rewards, which a simple weighted sum of reward lacks. In a tabular setting, Explore Options let the agent call an intrinsically motivated policy in order to learn from its trajectories. We introduce Deep Explore Options, revising Explore Options within the Deep Reinforcement Learning paradigm to tackle complex visual problems. Deep Explore Options can naturally learn from several unrelated intrinsic rewards, ignore harmful intrinsic rewards, learn to balance exploration, but also isolate exploitative and exploratory behaviors for independent usage. We test Deep Explore Options on hard and easy exploration games of the Atari Suite, following a benchmarking study to ensure fairness. Our empirical results show that they achieve similar results than weighted sum baselines, while maintaining their key properties.

1. INTRODUCTION

In Reinforcement Learning (RL), an agent is sequentially given states and needs to perform actions in order to maximize obtained extrinsic rewards r e . The agent is therefore deeply tied to the reward signal, and tends to fail when said signal is sparse or noisy. When the environment is very complex or high-dimensional, it is desirable for the agent to explore in a directed way (Thrun, 1992) , i.e. explicitly looking for new knowledge and experiences. One of the most common ways to generate such taskindependent, directed behaviors is through intrinsic motivation (IM), i.e. an alternative reward signal r i to spur curiosity and entice behavior exploration (Oudeyer & Kaplan, 2009; Schmidhuber, 2010) . IM biologically refers to the natural tendency of organisms to explore. One of the most common benchmarks for Deep RL agents has been the Arcade Learning Environment (ALE, Bellemare et al. ( 2013)), consisting of Atari video-games. In order to solve the most challenging, so-called hard-exploration games of the domain (Bellemare et al. ( 2016)), state-of-the-art Deep RL methods have integrated IM in complex learning mechanisms, and finally managed to overcome human-level play in all 57 games (NGU and Agent57, Badia et al. (2020b; a) ). However, these methods still fundamentally rely on a Weighted Sum (WS) of rewards r t = r e t + βr i t , for a very well-chosen and complex IM reward. So while the types of behaviors that we can extract with IM keep on expanding, we are still ultimately relying on a single signal to help exploitation. Instead, we might want to benefit from different and complementary intrinsic signals (Matusch et al., 2020) . In this paper, we refer to this challenge as IM Incorporation (IM-Inc). We extract several key desirable features of IM-Inc methods, including scalability, robustness and generality. In a tabular setting, Explore Options (EO, Bagot et al. ( 2020)) have been proposed as an alternative to the a weighted sum of rewards. The agent is divided into the Exploiter, trained exclusively with the extrinsic reward, and the Explorers, trained exclusively with the intrinsic rewards. The Exploiter can call any Explorers through options (Sutton et al., 1999) , i.e. additional actions, to explore for a fixed amount of time. However, because of their tabular nature, Explore Options are inadequate for function approximation; but it is a crucial element to allow the agent to generalize the option over states and effectively learn to balance exploration. We revise Explore Options into Deep Explore Options (DeepEOs), a new method for combining intrinsic and extrinsic reward signals in Deep RL. We introduce several key changes to usual IM approaches, and showcase their performance in the ALE. To provide fair and controlled comparisons, we match the algorithm and hyperparameters used in a benchmarking study (benchmark, Taiga et al. (2019) ) on IM. The contribution of our work is fourfold: • We revise Explore Options within Deep RL to propose Deep Explore Options as a strong alternative to a weighted sum when using intrinsic rewards, extending the benchmark. • We empirically show that, unlike a weighted sum of rewards, DeepEOs can learn from several IM rewards at once, and ignore harmful signals. • We empirically show that DeepEOs can extract the exploiting behavior, while learning meaningful and potentially transferable exploratory behaviors. • We provide a study of methods to combine intrinsic and extrinsic rewards. We propose several key desirable properties of such methods, and place several existing works within this framework, including DeepEOs. In Section 2, we provide background to build Deep Explore Options. Next, we introduce our method and discuss the introduced elements in Section 3. In Section 4, we provide experiments in MiniGrid to build intuition, then in Atari following the benchmark. We go over existing work in the field in Section 5. Finally, in Section 6, we provide a study of methods to combine intrinsic and extrinsic rewards.

2.1. REINFORCEMENT LEARNING, OPTIONS, MOTIVATION

We use the standard RL setting (Sutton & Barto ( 2018)), modelling the environment as a Markov Decision Process (S, A, R, p, γ) where S is the set of states, A is the set of actions, R ⊂ R is the set of rewards, p : S, A, S, R → [0, 1] is the dynamics function, and γ is the discount factor. The goal of RL is to maximize the expected sum of discounted rewards from any starting state. Options (Sutton et al. (1999) ) refer to temporally extended actions. An option is defined as a triple (I, π, β), where I ⊂ S is the option's initiation set, i.e. states in which the option can initiate; π is the option's policy; and β : S → [0, 1] is the option's termination condition. We assume one or several intrinsic reward functions f ir (s, a, s ′ ) = r i to generate a reward that we are interested in learning from. These can attempt to motivate directed exploration behaviors, but also any other behavior in the environment. An overview of the literature populating the field can be found in Section 5.

2.2. EXPLORE OPTIONS

Explore Options (EO, Bagot et al. ( 2020)) have been introduced as an alternative to a weighted sum of rewards. The method consists in decoupling the Agent into an Explorer, trained with the intrinsic reward r i , and an Exploiter, trained with extrinsic reward r e . Switching from Exploiter to Explorer is done through the Explore Option, which the Exploiter can use at any time to let the Explorer act for a fixed amount of steps c switch . Within the options framework, the j th Explore Option o j is therefore defined as ⟨S, π j , β c switch ⟩, where the initiation set S is the entire state space, the option policy π j is the Explorer policy trained with intrinsic reward r i,j , and the termination function β c switch deterministically interrupts the option call after c switch steps. Explore Options have only been introduced in a tabular setting. However, by design they only make sense in a function approximation setting: the option learning requires generalization in order to call the option in states where little is known, and therefore directed exploration is required. Function approximation also allows for parameter sharing, and potentially generalization across tasks.

3. DEEP EXPLORE OPTIONS

Fig. 1 gives an overview of the general Deep Explore Option framework, of which we go into more detail in the following subsections.

