DIVERSE EXPLORATION VIA INFOMAX OPTIONS

Abstract

In this paper, we study the problem of autonomously discovering temporally abstracted actions, or options, for exploration in reinforcement learning. For learning diverse options suitable for exploration, we introduce the infomax termination objective defined as the mutual information between options and their corresponding state transitions. We derive a scalable optimization scheme for maximizing this objective via the termination condition of options, yielding the InfoMax Option Critic (IMOC) algorithm. Through illustrative experiments, we empirically show that IMOC learns diverse options and utilizes them for exploration. Moreover, we show that IMOC scales well to continuous control tasks.

1. INTRODUCTION

Abstracting a course of action as a higher-level action, or an option (Sutton et al., 1999) , is a key ability for reinforcement learning (RL) agents in several aspects, including exploration. In RL problems, an agent learns to approximate an optimal policy only from experience, given no prior knowledge. This leads to the necessity of exploration: an agent needs to explore the poorly known states for collecting environmental information, sometimes sacrificing immediate rewards. For statistical efficiency, it is important to explore the state space in a deep and directed manner, rather than taking uniformly random actions (Osband et al., 2019) . Options can represent such directed behaviors by capturing long state jumps from their starting regions to terminating regions. It has been shown that well-defined options can facilitate exploration by exploiting an environmental structure (Barto et al., 2013) or, more generally, by reducing decision steps (Fruit and Lazaric, 2017) . A key requirement for such explorative options is diversity. If all options have the same terminating region, they will never encourage exploration. Instead, options should lead to a variety of regions for encouraging exploration. However, automatically discovering diverse options in a scalable, online manner is challenging due to two difficulties: generalization and data limitation. Generalization with function approximation (Sutton, 1995) is important for scaling up RL methods to large or continuous domains. However, many existing option discovery methods for exploration are graph-based (e.g., Machado et al. (2017) ) and incompatible with function approximation, except for that by Jinnai et al. (2020) . Discovering options online in parallel with polices requires us to work with limited data sampled from the environment and train the model for evaluating the diversity in a data-efficient manner. To address these difficulties, we introduce the infomax termination objective defined as the mutual information (MI) between options and their corresponding state transitions. This formulation reflects a simple inductive bias: for encouraging exploration, options should terminate in a variety of regions per starting regions. Thanks to the information-theoretical formulation, this objective is compatible with function approximation and scales up to continuous domains. A key technical contribution of this paper is the optimization scheme for maximizing this objective. Specifically, we employ a simple classification model over options as a critic for termination conditions, which makes our method data-efficient and tractable in many domains. The paper is organized as follows. After introducing background and notations, we present the infomax termination objective and derive a practical optimization scheme using the termination gradient theorem (Harutyunyan et al., 2019) . We then implement the infomax objective on the option-critic architecture (OC) (Bacon et al., 2017) with algorithmic modifications, yielding the InfoMax Option Critic (IMOC) algorithm. Empirically, we show that (i) IMOC improves exploration in structured environments, (ii) IMOC improves exporation in lifelong learning, (iii) IMOC is scalable to MuJoCo continuous control tasks, and (iv) the options learned by IMOC are diverse and meaningful. We then relate our method to other option-learning methods and the empowerment concept (Klyubin et al., 2005) , and finally give concluding remarks.

2. BACKGROUND AND NOTATION

We assume the standard RL setting in the Markov decision process (MDP), following Sutton and Barto ( 2018). An MDP M consists of a tuple (X , A, p, r, γ), where X is the set of states, A is the set of actions, p : X × A × X → [0, 1] is the state transition function, r : X × A → [r min , r max ] is the reward function, and 0 ≤ γ ≤ 1 is the discount factor. A policy is a probability distribution over actions conditioned on a state x, π : X × A → [0, 1]. For simplicity, we consider the episodic setting where each episode ends when a terminal state x T is reached. In this setting, the goal of an RL agent is to approximate a policy that maximizes the expected discounted cumulative reward per episode: J RL (π) = E π,x0 T -1 t=0 γ t R t , where R t = r(x t , a t ) is the reward received at time t, and x 0 is the initial state of the episode. Relatedly, we define the action-value function Q π (x t , a t ) def = E xt,at,π T -1 t =t γ t -t R t and the state-value function V π (x t ) def = E xt,π a π(a|x t )Q π (x t , a). Assuming that π is differentiable by the policy parameters θ π , a simple way to maximize the objective (1) is the policy gradient method (Williams, 1992) that estimates the gradient by: ∇ θπ J RL (π) = E π,xt ∇ θπ log π(a t |x t ) Â(x t , a t ) , where Â(x t , a t ) is the estimation of the advantage function A π (x t , a t ) def = Q π (x t , a t ) -V π (x t ). A common choice of Â(x t , a t ) is N -step TD error N i=0 γ i R t+i + γ N V (x t+N ) -V (x t ), where N is a fixed rollout length (Mnih et al., 2016) .

2.1. OPTIONS FRAMEWORK

Options (Sutton et al., 1999) provide a framework for representating temporally abstracted actions in RL. An option o ∈ O consists of a tuple (I o , β o , π o ), where I o ⊆ X is the initiation set, β o : X → [0, 1] is a termination function with β o (x) denoting the probability that option o terminates in state x, and π o is intra-option policy. Following related studies (Bacon et al., 2017; Harutyunyan et al., 2019) , we assume that I o = X and learn only β o and π o . Letting x s denote an option-starting state and x f denote an option-terminating state, we can write the option transition function as: P o (x f |x s ) = β o (x f )I x f =xs + (1 -β o (x s )) x p π o (x|x s )P o (x f |x), where I is the indicator function and p π o is the policy-induced transition function p π o (x |x) def = a∈A π o (a|x)p(x |x, a). We assume that all options eventually terminate so that P o is a valid probability distribution over x f , following Harutyunyan et al. (2019) . To present option-learning methods, we define two option-value functions: Q O and U O , where Q O is the option-value function denoting the value of selecting an option o at a state x t defined by et al., 1999) and denotes the value of reaching a state x t with o and not having selected the new option. Q O (x t , o) def = E π,β,µ T -1 t =t γ t -t R t . Analogously to Q π and V π ,

2.2. OPTION CRITIC ARCHITECTURE

OC (Bacon et al., 2017) provides an end-to-end algorithm for learning π o and β o in parallel. To optimize π o , OC uses the intra-option policy gradient method that is the option-conditional version



we let V O denote the marginalized option-value function V O (x) def = o µ(o|x)Q O (x, o), where µ(o|x s ) : X × O → [0, 1] is the policy over options. Function U O (x, o) def = (1 -β o (x))Q O (x, o) + β o (x)V O (x)is called the option-value function upon arrival (Sutton

