DIVERSE EXPLORATION VIA INFOMAX OPTIONS

Abstract

In this paper, we study the problem of autonomously discovering temporally abstracted actions, or options, for exploration in reinforcement learning. For learning diverse options suitable for exploration, we introduce the infomax termination objective defined as the mutual information between options and their corresponding state transitions. We derive a scalable optimization scheme for maximizing this objective via the termination condition of options, yielding the InfoMax Option Critic (IMOC) algorithm. Through illustrative experiments, we empirically show that IMOC learns diverse options and utilizes them for exploration. Moreover, we show that IMOC scales well to continuous control tasks.

1. INTRODUCTION

Abstracting a course of action as a higher-level action, or an option (Sutton et al., 1999) , is a key ability for reinforcement learning (RL) agents in several aspects, including exploration. In RL problems, an agent learns to approximate an optimal policy only from experience, given no prior knowledge. This leads to the necessity of exploration: an agent needs to explore the poorly known states for collecting environmental information, sometimes sacrificing immediate rewards. For statistical efficiency, it is important to explore the state space in a deep and directed manner, rather than taking uniformly random actions (Osband et al., 2019) . Options can represent such directed behaviors by capturing long state jumps from their starting regions to terminating regions. It has been shown that well-defined options can facilitate exploration by exploiting an environmental structure (Barto et al., 2013) or, more generally, by reducing decision steps (Fruit and Lazaric, 2017) . A key requirement for such explorative options is diversity. If all options have the same terminating region, they will never encourage exploration. Instead, options should lead to a variety of regions for encouraging exploration. However, automatically discovering diverse options in a scalable, online manner is challenging due to two difficulties: generalization and data limitation. Generalization with function approximation (Sutton, 1995) is important for scaling up RL methods to large or continuous domains. However, many existing option discovery methods for exploration are graph-based (e.g., Machado et al. ( 2017)) and incompatible with function approximation, except for that by Jinnai et al. (2020) . Discovering options online in parallel with polices requires us to work with limited data sampled from the environment and train the model for evaluating the diversity in a data-efficient manner. To address these difficulties, we introduce the infomax termination objective defined as the mutual information (MI) between options and their corresponding state transitions. This formulation reflects a simple inductive bias: for encouraging exploration, options should terminate in a variety of regions per starting regions. Thanks to the information-theoretical formulation, this objective is compatible with function approximation and scales up to continuous domains. A key technical contribution of this paper is the optimization scheme for maximizing this objective. Specifically, we employ a simple classification model over options as a critic for termination conditions, which makes our method data-efficient and tractable in many domains. The paper is organized as follows. After introducing background and notations, we present the infomax termination objective and derive a practical optimization scheme using the termination gradient theorem (Harutyunyan et al., 2019) . We then implement the infomax objective on the option-critic architecture (OC) (Bacon et al., 2017) with algorithmic modifications, yielding the InfoMax Option Critic (IMOC) algorithm. Empirically, we show that (i) IMOC improves exploration in structured environments, (ii) IMOC improves exporation in lifelong learning, (iii) IMOC is scalable 1

