DIVERSE EXPLORATION VIA INFOMAX OPTIONS

Abstract

In this paper, we study the problem of autonomously discovering temporally abstracted actions, or options, for exploration in reinforcement learning. For learning diverse options suitable for exploration, we introduce the infomax termination objective defined as the mutual information between options and their corresponding state transitions. We derive a scalable optimization scheme for maximizing this objective via the termination condition of options, yielding the InfoMax Option Critic (IMOC) algorithm. Through illustrative experiments, we empirically show that IMOC learns diverse options and utilizes them for exploration. Moreover, we show that IMOC scales well to continuous control tasks.

1. INTRODUCTION

Abstracting a course of action as a higher-level action, or an option (Sutton et al., 1999) , is a key ability for reinforcement learning (RL) agents in several aspects, including exploration. In RL problems, an agent learns to approximate an optimal policy only from experience, given no prior knowledge. This leads to the necessity of exploration: an agent needs to explore the poorly known states for collecting environmental information, sometimes sacrificing immediate rewards. For statistical efficiency, it is important to explore the state space in a deep and directed manner, rather than taking uniformly random actions (Osband et al., 2019) . Options can represent such directed behaviors by capturing long state jumps from their starting regions to terminating regions. It has been shown that well-defined options can facilitate exploration by exploiting an environmental structure (Barto et al., 2013) or, more generally, by reducing decision steps (Fruit and Lazaric, 2017) . A key requirement for such explorative options is diversity. If all options have the same terminating region, they will never encourage exploration. Instead, options should lead to a variety of regions for encouraging exploration. However, automatically discovering diverse options in a scalable, online manner is challenging due to two difficulties: generalization and data limitation. Generalization with function approximation (Sutton, 1995) is important for scaling up RL methods to large or continuous domains. However, many existing option discovery methods for exploration are graph-based (e.g., Machado et al. (2017) ) and incompatible with function approximation, except for that by Jinnai et al. (2020) . Discovering options online in parallel with polices requires us to work with limited data sampled from the environment and train the model for evaluating the diversity in a data-efficient manner. To address these difficulties, we introduce the infomax termination objective defined as the mutual information (MI) between options and their corresponding state transitions. This formulation reflects a simple inductive bias: for encouraging exploration, options should terminate in a variety of regions per starting regions. Thanks to the information-theoretical formulation, this objective is compatible with function approximation and scales up to continuous domains. A key technical contribution of this paper is the optimization scheme for maximizing this objective. Specifically, we employ a simple classification model over options as a critic for termination conditions, which makes our method data-efficient and tractable in many domains. The paper is organized as follows. After introducing background and notations, we present the infomax termination objective and derive a practical optimization scheme using the termination gradient theorem (Harutyunyan et al., 2019) . We then implement the infomax objective on the option-critic architecture (OC) (Bacon et al., 2017) with algorithmic modifications, yielding the InfoMax Option Critic (IMOC) algorithm. Empirically, we show that (i) IMOC improves exploration in structured environments, (ii) IMOC improves exporation in lifelong learning, (iii) IMOC is scalable to MuJoCo continuous control tasks, and (iv) the options learned by IMOC are diverse and meaningful. We then relate our method to other option-learning methods and the empowerment concept (Klyubin et al., 2005) , and finally give concluding remarks.

2. BACKGROUND AND NOTATION

We assume the standard RL setting in the Markov decision process (MDP), following Sutton and Barto (2018) . An MDP M consists of a tuple (X , A, p, r, γ), where X is the set of states, A is the set of actions, p : X × A × X → [0, 1] is the state transition function, r : X × A → [r min , r max ] is the reward function, and 0 ≤ γ ≤ 1 is the discount factor. A policy is a probability distribution over actions conditioned on a state x, π : X × A → [0, 1]. For simplicity, we consider the episodic setting where each episode ends when a terminal state x T is reached. In this setting, the goal of an RL agent is to approximate a policy that maximizes the expected discounted cumulative reward per episode: J RL (π) = E π,x0 T -1 t=0 γ t R t , where R t = r(x t , a t ) is the reward received at time t, and x 0 is the initial state of the episode. Relatedly, we define the action-value function Q π (x t , a t ) def = E xt,at,π T -1 t =t γ t -t R t and the state-value function V π (x t ) def = E xt,π a π(a|x t )Q π (x t , a). Assuming that π is differentiable by the policy parameters θ π , a simple way to maximize the objective (1) is the policy gradient method (Williams, 1992 ) that estimates the gradient by: ∇ θπ J RL (π) = E π,xt ∇ θπ log π(a t |x t ) Â(x t , a t ) , where Â(x t , a t ) is the estimation of the advantage function A π (x t , a t ) def = Q π (x t , a t ) -V π (x t ). A common choice of Â(x t , a t ) is N -step TD error N i=0 γ i R t+i + γ N V (x t+N ) -V (x t ), where N is a fixed rollout length (Mnih et al., 2016) .

2.1. OPTIONS FRAMEWORK

Options (Sutton et al., 1999) provide a framework for representating temporally abstracted actions in RL. An option o ∈ O consists of a tuple (I o , β o , π o ), where I o ⊆ X is the initiation set, β o : X → [0, 1] is a termination function with β o (x) denoting the probability that option o terminates in state x, and π o is intra-option policy. Following related studies (Bacon et al., 2017; Harutyunyan et al., 2019) , we assume that I o = X and learn only β o and π o . Letting x s denote an option-starting state and x f denote an option-terminating state, we can write the option transition function as: P o (x f |x s ) = β o (x f )I x f =xs + (1 -β o (x s )) x p π o (x|x s )P o (x f |x), where I is the indicator function and p π o is the policy-induced transition function p π o (x |x) def = a∈A π o (a|x)p(x |x, a). We assume that all options eventually terminate so that P o is a valid probability distribution over x f , following Harutyunyan et al. (2019) . To present option-learning methods, we define two option-value functions: Q O and U O , where Q O is the option-value function denoting the value of selecting an option o at a state x t defined by Q O (x t , o) def = E π,β,µ T -1 t =t γ t -t R t . Analogously to Q π and V π , we let V O denote the marginalized option-value function V O (x) def = o µ(o|x)Q O (x, o), where µ(o|x s ) : X × O → [0, 1] is the policy over options. Function U O (x, o) def = (1 -β o (x))Q O (x, o) + β o (x)V O (x) is called the option-value function upon arrival (Sutton et al., 1999) and denotes the value of reaching a state x t with o and not having selected the new option.

2.2. OPTION CRITIC ARCHITECTURE

OC (Bacon et al., 2017) provides an end-to-end algorithm for learning π o and β o in parallel. To optimize π o , OC uses the intra-option policy gradient method that is the option-conditional version of the gradient estimator ( 2 ), ∇ θ π o J RL (π o ) = E ∇ θ π o log π o (a t |x t ) Âo (x t , a t ) , where Âo is an estimation of the option-conditional advantage A π o . For optimizing β o , OC directly maximizes Q O using the estimated gradient: ∇ θ β o Q O (x, o) = γE -∇ θ β o β o (x) Q O (x, o) -V O (x) . Intuitively, this decreases the termination probability β o (x) when holding an o is advantageous, i.e., Q O (x) -V O (x) is positive, and vice versa. Our method basically follows OC but has a different objective for learning β o .

2.3. TERMINATION CRITIC

Recently proposed termination critic (TC) (Harutyunyan et al., 2019) optimizes β o by maximizing the information-theoretic objective called predictability: J TC (P o ) = -H(X f |o), where H denotes entropy and X f is the random variable denoting the option-terminating states. Maximizing -H(X f |o) makes the terminating region of an option smaller and more predictable. In other words, we can compress terminating regions by optimizing the objective (5). To derivate this objective by the beta parameters θ β o , Harutyunyan et al. (2019) introduced the termination gradient theorem: Theorem 1. Let β o be parameterized with a sigmoid function and β o denote the logit of β o . We have ∇ θ β P o (x f |x s ) = x P o (x|x s )∇ θ β β o (x)(I x f =x -P o (x f |x)), Leveraging the theorem 1, TC performs gradient ascent using the estimated gradient: ∇ θ β o J TC (P o ) = -E xs,x,x f ∇ θ β β o (x)β o (x) log P o µ (x) -log P o µ (x f ) + 1 - P o (x f |x s )P o µ (x) P o µ (x f )P o (x|x s ) . where P o µ (x) is the marginalized distribution of option-terminating states. Contrary to the termination objective of OC (4), this objective does not depend on state values, making learned options robust against the reward structure of the environment. Our method is inspired by TC and optimizes a similar information-theoretic objective, not for predictability but for diversity. Also, our infomax objective requires an estimation of p(o|x s , x f ) instead of the option transition model P o (x f |x s ), which makes our method tractable in more environments.

3. INFOMAX OPTION CRITIC

We now present the key idea behind the InfoMax Option Critic (IMOC) algorithm. We first formulate the infomax termination objective based on the MI maximization, then derive a practical gradient estimation for maximizing this objective on β o , utilizing the termination gradient theorem 1. To evaluate the diversity of options, we use the MI between options and option-terminating states conditioned by option-starting states: J IMOC = I(X f ; O|X s ) = H(X f |X s ) -H(X f |X s , O), where I denotes conditional MI I(A; B|Z) = H(A|Z) -H(A|B, Z), X s is the random variable denoting an option-starting state, and O is the random variable denoting an option. We call this objective the infomax termination objective. Let us interpret X f |X s as the random variable denoting a state transition induced by an option. Then maximizing the MI (7) (i) diversifies a state transition X f |X s and (ii) makes an option-conditional state transition X f |X s , o more deterministic. Note that the marginalized MI I(X f ; O) also makes sense in that it prevents the terminating region of each option from being too broad, as predictability (5) does. However, in this study, we focus on the conditional objective since it is easier to optimize. To illustrate the limitation of infomax options, we conducted an analysis in a toy four-state deterministic chain environment, which has four states and two deterministic actions (go left and go right) per each state. Since deriving the exact solution is computationally difficult, we searched options that maximize H(X f |X s ) from deterministic options that has deterministic option-policies and termination functions (thus has the minimum H(X f |X s , O)). Among multiple solutions, Figure 1 shows two interesting instances of deterministic infomax options when |O| = 2. The left options enable diverse behaviors per state, although they fail to capture long-term behaviors generally favorable in the literature (e.g., Mann et al. (2015) ). On the other hand, the right options enable relatively long, two step state transitions, but they are the same and the rightmost state and the next one. Furthermore, an agent can be caught in a small loop that consists of the leftmost state and the next one. This example shows that (i) we can obtain short and diverse options with only a few options, (ii) to obtain long and diverse options, we need sufficiently many options, and (iii) an agent can be caught in a small loop with only a few options, failing to visit diverse states. As we show in Appendix A, this 'small loop' problem cannot happen with four options. Thus, the number of options is important when we are to maximize the MI ( 7) and a limitation of this method. However, in experiments, we show that we can practically learn diverse options with relatively small number of options. For maximizing the MI by gradient ascent, we now derive the gradient of the infomax termination objective (7). First, we estimate the gradient of the objective using the option transition model P o and marginalized option-transition model P (x f |x s ) = o µ(o|x s )P o (x f |x s ). Proposition 1. Let β o be parameterized with a sigmoid function. Given a trajectory τ = x s , . . . , x, . . . , x f sampled by π o and β o , we can obtain unbiased estimations of ∇ θ β H(X f |X s ) and ∇ θ β H(X f |X s , O) by ∇ θ β H(X f |X s ) = E xs,x,x f ,o -∇ θ β β o (x)β o (x) log P (x|x s ) -log P (x f |x s ) (8) ∇ θ β H(X f |X s , O) = E xs,x,x f ,o -∇ θ β β o (x)β o (x) log P o (x|x s ) -log P o (x f |x s ) (9) where β o (x) denotes the logit of β o (x). Note that the additional term β o is necessary because x is not actually a terminating state. The proof follows section 4 in Harutyunyan et al. (2019) and is given in Appendix B.1. The estimated gradient of the infomax termination objective (7) can now be written as: ∇ θ β I(X f ; O|X s ) = ∇ θ β H(X f |X s ) -∇ θ β H(X f |X s , O) = E xs,x,x f ,o -∇ θ β β o (x)β o (x) (log P (x|x s ) -log P (x f |x s ) -(log P o (x|x s ) -log P o (x f |x s ))) , which means that we can optimize this objective by estimating P o and P . However, estimating the probability over the state space can be difficult, especially when the state space is large, as common in the deep RL setting. Hence, we reformulate the gradient using Bayes' rule in a similar way as Gregor et al. (2017) . The resulting term consists of the reverse option transition p(o|x s , x f ) that denotes the probability of having an o given a state transition x s , x f . Proposition 2. We now have ∇ θ β I(X f ; O|X s ) =∇ θ β H(X f |X s ) -∇ θ β H(X f |X s , O) =E xs,x,x f ,o ∇ θ β β o (x)β o (x) log p(o|x s , x) -log p(o|x s , x f ) (11) The proof is given in Appendix B.2. In the following sections, we estimate the gradient ( 11) by learning a classification model over options p(o|x s , x f ) from sampled option transitions.

4. ALGORITHM

In this section, we introduce modifications for adjusting the OC (Bacon et al., 2017) to our infomax termination objective. Specifically, we implement IMOC on top of Advantage-Option Critic (AOC), a synchronous variant of A2OC (Harb et al., 2018) , yielding the Advantage-Actor InfoMax Option Critic (A2IMOC) algorithm. To stably estimates p(o|x s , x f ) for updating θ β , we sample recent option state transitions o, x s , x f from We follow AOC for optimizing option-policies except the following modifications and give a full description of A2IMOC in Appendix C.1. In continuous control experiments, we also used Proximal Policy InfoMax Option Critic (PPIMOC) that is an implementation of IMOC based on PPO (Schulman et al., 2017) . We give details of PPIMOC in Appendix C.2. Upgoing Option-Advantage Estimation Previous studies (e.g., Harb et al. (2018) ) estimated the advantage Âot (x t ) ignoring the future rewards after the current option o t terminates. Since longer rollout length often helps speed up learning (Sutton and Barto, 2018) , it is preferable to extend this estimation to use all available future rewards. However, future rewards after option termination heavily depends on the selected option, often leading to underestimation of Âo . Thus, to effectively use future rewards, we introduce an upgoing option-advantage estimation (UOAE). Let t + k denote the time step where the current option o t terminates in a sampled trajectory. Then, UOAE estimates the advantage by: Âo UOAE = -Q O (x t , o t ) +              k i=0 γ i R t+i + max   N j=k γ j R t+j , γ k V O (x t+k )   upgoing estimation (k < N ) N i=0 γ i R t+i + γ N U O (x t+N , o t ) (otherwise) . ( ) Similar to upgoing policy update (Vinyals et al., 2019) , the idea is to be optimistic about the future rewards after option termination by taking the maximum with V O .

Policy Regularization Based on Mutual Information

To perform MI maximization not only on termination functions but also on option-policies, we introduce a policy regularization based on the maximization of the conditional MI, I(A; O|X s ), where A is the random variable denoting an action. This MI can be interpreted as a local approximation of the infomax objective (7), assuming that each action leads to different terminating regions. Although optimizing the infomax termination objective diversifies option-policies implicitly, we found that this regularization helps learn diverse option-policies reliably. Letting π µ denote the marginalized policy π µ (a|x) def = o µ(o|x)π o (a|x), we write I(A; O|X s ) as: I(A; O|X s ) = H(A|X s ) -H(A|O, X s ) = E xs [H(π µ (x s ))] -E xs,o [H(π o (x s ))] . We use this regularization with the entropy bonus (maximization of H(π o )) common in policy gradient methods (Williams and Peng, 1991; Mnih et al., 2016) and write the overall regularization term as c Hµ H(π µ (x)) + c H H(π o (x)), where c Hµ and c H are weights of each regularization term. Note that we add this regularization term on not only option-starting states but all sampled states. This introduces some bias, which we did 

5. EXPERIMENTS

We conducted a series of experiments to show two use cases of IMOC: exploration in structured environments and exploration for lifelong learning (Brunskill and Li, 2014) . In this section, we used four options for all option-learning methods and compared the number of options in Appendix D.7.

5.1. SINGLE TASK LEARNING IN STRUCTURED ENVIRONMENTS

We consider two 'Four Rooms' domains, where diverse options are beneficial for utilizing environmental structures. Gridworld Four Rooms with Suboptimal Goals First, we tested IMOC in a variant of the classical Four Rooms Gridworld (Sutton et al., 1999) with suboptimal goals. An agent is initially placed at the upper left room and receives a positive reward only at goal states: two closer goals with +1 reward and the farthest goal with +2 reward, as shown in Figure 2a . The episode ends when an agent reaches one of the goals. The optimal policy is aiming the farthest goal in the lower right room without converging to suboptimal goals. Thus, an agent is required to learn multimodal behaviors leading to multiple goals, which options can help. In this environment, we compared A2IMOC with A2C (Mnih et al., 2016; Wu et al., 2017) , AOC, and our tuned version of AOC (our AOC) with all enhancements presented in section 4 to highlight the effectiveness of the termination objective among all of our improvements. 1 We show the progress of average cumulative rewards over ten trials in Figure 2b . A2IMOC performed the best and found the optimal goal in most trials. AOC and our AOC also occasionally found the optimal goal, while A2C overfitted to either of the suboptimal goals through all trials. Figure 3 illustrates learned option-polices and termination functions of each compared method. 2 Terminating regions learned with A2IMOC is diverse. For example, option 0 mainly terminates in the right rooms while option 3 terminates in the left rooms. We see that termination regions learned by A2IMOC are diverse and clearly separated per each option. Although all option policies converged to the same near optimal one, we show that A2IMOC diversifies option-policies at the beginning of learning in Appendix D.4. On the other hand, terminating regions learned with AOC overlap each other, and notably, option 3 has no terminating region. We assume this is because the loss function (4) decreases the terminating probability when the advantage is positive. We can see the same tendency in our AOC, although we cannot see the vanishment of the terminating regions. MuJoCo Point Four Rooms To show the scalability of IMOC in continuous domains, we conducted experiments in a similar four rooms environment, based on the MuJoCo (Todorov et al., 2012) physics simulator and "PointMaze" environment in rllab (Duan et al., 2016) . This environment follows the Gridworld Four Rooms and has three goals as shown in Figure 4a : two green goals with +0.5 reward, and a red one with +1.0. An agent controls the rotation and velocity of the orange ball and receives a positive reward only at the goals. In this environment, we compared PPIMOC with PPOC (Klissarov et al., 2017) , PPO, and our tuned version of PPOC (our PPOC) that is the same as PPIMOC except for the termination objective. An important difference is that PPOC uses a parameterized µ trained by policy gradient, while PPIMOC and our PPOC use -greedy for option selection. Figure 4b show the progress of average cumulative rewards over five trials. PPIMOC found the optimal goal four times in five trials and performed slightly bettern than PPO. In a qualitative 

5.2. SINGLE TASK LEARNING IN CLASSICAL CONTINUOUS CONTROL

Additionally, we test PPIMOC on two classical, hard-exploration control problems: Mountain Car (Barto et al., 1983) and Cartpole swingup (Moore, 1990) . In Mountain Car, PPIMOC and our PPOC successfully learns to reache the goal, while PPO and PPOC converged to run around the start posisition. In Cartpole swing up, PPIMOC and our PPOC performed better than PPO and PPOC, but still failed to learn stable behaviors.

5.3. EXPLORATION FOR LIFELONG LEARNING

As another interesting application of IMOC, we consider the lifelong learning setting. Specifically, we tested IMOC in 'Point Billiard' environment. In this environment, an agent receives a positive reward only when the blue objective ball reaches the goal, pushed by the agent (orange ball). Figure 6a shows all four configurations of Point Billiard that we used. There are four goals: green goals with +0.5 reward and a red one with +1.0 reward. The positions of four goals move clockwise after 1M environmental steps and agents need to adapt to the new positions of goals. We compared PPIMOC with PPO, PPOC, and our PPOC in this environment. Figure 6b shows the progress of average cumulative rewards over five trials. Both PPIMOC and our PPOC performed the best and adapted to all reward transitions. On the other hand, PPO and PPOC struggle to adapt to the second transition, where the optimal goal moves behind the agent. The ablation study given in Appendix D.6 shows that UOAE (12) works effectively in this task. However, without UOAE, PPIMOC still outperformed PPO. Thus, we argue that having diverse terminating regions itself is beneficial for adapting to new reward functions in environments with subgoals.

6. RELATED WORK

Options for Exploration Options (Sutton et al., 1999) in RL are widely studied for many applications, including speeding up planning (Mann and Mannor, 2014) and transferring skills (Konidaris and Barto, 2007; Castro and Precup, 2010) . However, as discussed by Barto et al. (2013) , their benefits for exploration are less well recognized. Many existing methods focused on discovering subgoals that effectively decompose the problem then use such subgoals for encouraging exploration. Subgoals are discovered based on various properties, including graphical features of state transitions (Simsek and Barto, 2008; Machado et al., 2017; Jinnai et al., 2019) and causality (Jonsson and Barto, 2006; Vigorito and Barto, 2010) . In contrast, our method directly optimizes termination functions instead of discovering subgoals, capturing environmental structures implicitly. From a theoretical perspective, Fruit and Lazaric (2017) analyzed that good options can improve the exploration combined with state-action visitation bonuses. Using infomax options with visitation bonuses would be an interesting future direction. End-to-end learning of Options While many studies attempted to learn options and option-policies separately, Bacon et al. (2017) proposed OC to train option-policies and termination functions in parallel. OC has been extended with various types of inductive biases, including deliberation cost (Harb et al., 2018) , interest (Khetarpal et al., 2020) , and safety (Jain et al., 2018) . Our study is directly inspired by an information-theoretic approach presented by Harutyunyan et al. (2019) , as we noted in section 2.

Mutual Information and Skill

Learning MI often appears in the literature of intrinsically motivated (Singh et al., 2004) reinforcement learning, as a driver of goal-directed behaviors. A well-known example is the empowerment (Klyubin et al., 2005; Salge et al., 2013) , which is obtained by maximizing the MI between sequential k actions and the resulting state I(a t , ..., a t+k ; x t+k |x t ). Some works (Mohamed and Rezende, 2015; Zhao et al., 2020) implemented lower bound maximization of empowerment as intrinsic rewards for RL agents, encouraging goal-directed behaviors in the absence of extrinsic rewards We can interpret our objective I(X f ; O|X s ) as empowerment between limited action sequences and states corresponding to options. Gregor et al. (2017) employed this interpretation and introduced a method for maximizing the variational lower bound of this MI via option-policies, using the same model as our p, while we aim to maximize the MI via termination functions. MI is also used for intrinsically motivated discovery of skills, assuming that diversity is important to acquire useful skills. Eysenbach et al. (2019) proposed to maximize MI between skills and states I(O; X), extended to the conditional one I(O; X |X) by Sharma et al. (2020) . Although our study shares the same motivation for using MI as these methods, i.e. diversifying sub-policies, the process of MI maximization is significantly different: our method optimizes termination functions, while their methods optimize conditional policies by using MI as intrinsic rewards.

7. CONCLUSION

We presented a novel end-to-end option learning algorithm InfoMax Option Critic (IMOC) that uses the infomax termination objective to diversify options. Empirically, we showed that IMOC improves exploration in structured environments and for lifelong learning, even in continuous control tasks. We also quantitatively showed the diversity of learned options. An interesting future direction would be combining our method for learning termination conditions with other methods for learning option-policies, e.g., by using MI as intrinsic rewards. A limitation of the infomax objective presented in this study is that it requires on-policy data for training. Hence, another interesting line of future work is extending IMOC to use for off-policy option discovery. 

A MORE ANALYSIS ON DETERMINISTIC CHAIN EXAMPLE

Figure 7 shows infomax options with three options and four options in the four state deterministic chain example. Among multiple solutions, we selected options with an absorbing state per option (i.e., β o (x) = 1.0 for only one x), which are partially the same as the right options in Figure 1 . With four options, Pr(x f |x s ) = 0.25 for all x f and x s , thus H(X f |X s ) is the maximum. This example shows that we need sufficiently many options for maximizing the MI, otherwise an agent can be caught in a small loop as we described in Section 3.

B OMITTED PROOFS B.1 PROOF OF PROPOSITION 1

First, we repeat the assumption 2 in Harutyunyan et al. (2019) . Assumption 1. The distribution d µ (•|o) over the starting states of an option o under policy µ is independent of its termination condition β o . Note that this assumption does not strictly hold since -Greedy option selection depends on β o via Q O . However, since this dependency is not so strong, we found that β o reliably converged in our experiments. Lemma 1. Assume that the distribution d µ (•|o) over the starting state of an option o under policy µ is independent with β o . Then the following equations hold. ∇ θ β H(X f |X s ) = - xs,o d µ (x s , o) x P o (x|x s )∇ θ β β o (x) log P (x|x s ) + 1 - x f P o (x f |x) log P (x f |x s ) + 1 (14) ∇ θ β H(X f |X s , O) = - xs,o d µ (x s , o) x P o (x|x s )∇ θ β β o (x) log P o (x|x s ) + 1 - x f P o (x f |x) log P o (x f |x s ) + 1 , (15) Sampling x s , x, x f , o from d µ and P o , ∇ θ β H(X f |X s ) = E xs,x,x f ,o -∇ θ β β o (x)β o (x) log P (x|x s ) -log P (x f |x s ) (16) ∇ θ β H(X f |X s , O) = E xs,x,x f ,o -∇ θ β β o (x)β o (x) log P o (x|x s ) -log P o (x f |x s ) . ( ) Proof of Lemma 1 Proof. First, we prove Equation ( 14). Let d µ (x s ) denote the probability distribution over x s under the policy µ, or the marginal distribution of d µ (x s |o), and d µ (x s , o) denote the joint distribution of x s and o. Then, we have: ∇ θ β H(X f |X s ) = -∇ θ β xs d µ (x s ) x f P (x f |x s ) log P (x f |x s ) = - xs d µ (x s ) x f ∇ θ β P (x f |x s ) log P (x f |x s ) + P (x f |x s ) ∇ θ β P (x f |x s ) P (x f |x s ) = - xs d µ (x s ) x f ∇ θ β P (x f |x s ) log P (x f |x s ) + 1 = - xs d µ (x s ) x f o µ(o|x s ) ∇ θ β P o (x f |x s ) Apply theorem (6) log P (x f |x s ) + 1 = - xs d µ (x s ) x f o µ(o|x s ) x P o (x|x s )∇ θ β β o (x)(I x f =x -P o (x f |x)) log P (x f |x s ) + 1 = - xs d µ (x s ) o µ(o|x s ) x P o (x|x s )∇ θ β β o (x) x f (I x f =x -P o (x f |x)) log P (x f |x s ) + 1 = - xs,o d µ (x s , o) sample x P o (x|x s ) sample ∇ θ β β o (x) × log P (x|x s ) + 1 - x f P o (x f |x) sample log P (x f |x s ) + 1 . Sampling x s , x, x f , o, we get (16). Then we prove Equation (15). ∇ θ β H(X f |X s , O) = -∇ θ β xs,o d µ (x s , o) x f P o (x f |x s ) log P o (x f |x s ) = - xs,o d µ (x s , o) x f ∇ θ β P o (x f |x s ) log P o (x f |x s ) + P o (x f |x s ) ∇ θ β P o (x f |x s ) P o (x f |x s ) = - xs,o d µ (x s , o) x f ∇ θ β P o (x f |x s ) Apply theorem ( 6) log P o (x f |x s ) + 1 = - xs,o d µ (x s , o) x f x P o (x|x s )∇ θ β β o (x)(I x f =x -P o (x f |x)) log P o (x f |x s ) + 1 = - xs,o d µ (x s , o) x P o (x|x s )∇ θ β β o (x) x f (I x f =x -P o (x f |x)) log P o (x f |x s ) + 1 = - xs,o d µ (x s , o) sample x P o (x|x s ) sample ∇ θ β β o (x) × log P o (x|x s ) + 1 - x f P o (x f |x) sample log P o (x f |x s ) + 1 Sampling x s , x, x f , o, we get (17).

B.2 PROOF OF PROPOSITION 2

Proof. First, we have that: log P o (x f |x s ) -log P (x f |xs) = log P o (x f |x s ) P (x f |xs) = log Pr(x f |x s , o) P (x f |xs) = log Pr(x s , x f , o) Pr(x s ) Pr(x s , x f ) Pr(x s , o) = log Pr(x s , x f , o) Pr(x s , x f ) Pr(o|x s ) = log Pr(o|x s , x f ) Pr(x s , x f ) Pr(x s , x f ) Pr(o|x s ) = log p(o|x s , x f ) µ(o|x s ) Using this equation, we can rewrite the equation ( 10) as: ∇ θ β I(X f ; O|X s ) = ∇ θ β H(X f |X s ) -∇ θ β H(X f |X s , O) = E xs,x,x f ,o -∇ θ β β o (x)β o (x) log P (x|x s ) -log P (x f |x s ) -log P o (x|x s ) + log P o (x f |x s ) = E xs,x,x f ,o ∇ θ β β o (x) log P o (x|x s ) -log P (x|x s ) -log P o (x f |x s ) -log P (x f |x s ) = E xs,x,x f ,o ∇ θ β β o (x) log p(o|x s , x) µ(o|x s ) -log p(o|x s , x f ) µ(o|x s ) = E xs,x,x f ,o ∇ θ β β o (x) log p(o|x s , x) -log p(o|x s , x f ) C IMPLEMENTATION DETAILS C.1 THE WHOLE ALGORITHM OF A2IMOC Algorithm 1 shows a full description of A2IMOC. It follows the architecture of A2C (Mnih et al., 2016; Wu et al., 2017) and has multiple synchronous actors and a single learner. At each optimization step, we update π o , Q O , and β o from online trajectories collected by actors. We update p(o|x s , x f ) for estimating the gradient (11) and μ(o|x s ) for entropy regularization (13). To learn p and μ stably, we maintain a replay buffer B O that stores option-transitions, implemented by a LIFO queue. Note that using older o, x s , x f sampled from the buffer can introduce some bias in the learned p and μ, since they depend on the current π o and β o . However, we found that this is not harmful when the capacity of the replay buffer is reasonably small. We also add maximization of the entropy of β o to the loss function for preventing the termination probability saturating on zero or one. Then the full objective of β o is written as: log p(o|x s , x) -log p(o|x s , x f ) + c H β H(β o (x)), where c H β is a weight of entropy bonus.

C.2 IMPLEMENTATION OF PPIMOC

For continuous control tasks, we introduce PPIMOC on top of PPO, with the following modifications to A2IMOC. Train p and μ by option-transitions sampled from B O 20: end for Upgoing Option-Advantage Estimation for GAE To use an upgoing option-advantage estimation ( 12) with Generalized Advantage Estimator (GAE) (Schulman et al., 2015b) common with PPO, we introduce an upgoing general option advantage estimation (UGOAE). Letting δ denote the TD error corresponding to the marginalized option-state-values, δ t = R t + γV O (x t+1 ) -V O (x t ), we write the GAE for marginalized policy π µ as Âo µ = N i=0 (γλ) i δ t+i , where λ is a coefficient. Supposing that o t terminates at the t + k step and letting δ o denote the TD error corresponding to an option-state value δ o t = R t + γQ O (x t+1 , o) -Q O (x t , o ), we formulate UGOAE by: Âo UGOAE =            k i=0 (γλ) i δ o t+i + max( N i=k+1 (γλ) i δ t+i , 0) upgoing estimation (k < N ) N -1 i=0 (γλ) i δ o t+i + (γλ) N (R t+N + γU O (x t+N +1 ) -Q O (x t+N , o)) (otherwise). ( ) The idea is the same as UOAE ( 12) and is optimistic about the advantage after option termination. Clipped β o Loss In our preliminary experiments, we found that performing multiple steps of optimization on the gradient (11) led to destructively large updates and resulted in the saturation of β o to zero or one. Hence, to perform PPO-style multiple updates on β o , we introduce a clipped loss for β o : ∇ θ β clip( β o (x) -β o old (x), -β , β )β o old (x) log p(o|x s , x) -log p(o|x s , x f ) , ( ) where β is a small coefficient, β o old is a β o before the update, and clip(x, -, ) = max(-, min( , x)). Clipping makes the gradient zero when β o is sufficiently different than β o old and inhibits too large updates.

D EXPERIMENTAL DETAILS

D.1 NETWORK ARCHITECTURE Figure 8 illustrates the neural network architecture used in our experiments. In Gridworld experiments, we used the same state encoder for all networks and we found that it is effective for diversifying π o as an auxiliary loss (Jaderberg et al., 2017) . However, in MuJoCo experiments, we found that sharing In Gridworld experiments, we represented a state as an image and encoded it by a convolutional layer with 16 filters of size 4 × 4 with stride 1, followed by a convolutional layer with 16 filters of size 2 × 2 with stride 1, followed by a fully connected layer with 128 units. In MuJoCo experiments, we encode the state by two fully connected layers with 64 units. π o is parameterized as a Gaussian distribution with separated networks for standard derivations per option, similar to Schulman et al. (2015a) . We used ReLU as an activator for all hidden layers and initialized networks by the orthogonal (Saxe et al., 2014) initialization in all experiments. Unless otherwise noted, we used the default parameters in PyTorch (Paszke et al., 2019) 1.5.0.

D.2 HYPERPARAMETERS

When evaluating agents, we used -Greedy for selecting options with opt and did not use deterministic evaluation (i.e., an agent samples actions from π o ) in all experiments. We show the algorithm-specific hyperparameters of A2IMOC in Table 1 . In Gridworld experiments, we used = 0.1 for AOC. Our AOC implementation is based on the released codefoot_2 and uses truncated N -step advantage. Other parameters of AOC and A2C are the same as A2IMOC. We also show the hyperparameters of PPIMOC in Table 2 . For PPOC, we used c µent = 0.001 for the weight of the entropy H(µ(x)). Our PPOC implementation is based on the released codefoot_3 and uses N -step (not truncated) GAE for computing advantage. PPOC and PPO shares all other parameters with PPIMOC. MuJoCo Point environments are implemented based on "PointMaze" in rllab (Duan et al., 2016) with some modifications, mainly around collision detection. The maximum episode length is 1000. In Four Rooms task, an agent receives +0.5 or +1.0 reward when it reaches a goal. Otherwise, an action penalty -0.0001 is given. This reward structure is the same in Billiard Task: an agent receives a goal reward when the object ball reaches a goal, otherwise it receives penalty.

D.4 EARLY OPTION-POLICIES LEARNED IN GRIDWORLD FOUR ROOMS

Figure 9 shows early option-polices and termination probabilities in Gridworld Four Rooms experiment. We can see that A2IMOC learned the most diverse option-policies.

D.5 QUALITATIVE ANALYSIS OF POINT FOUR ROOMS EXPERIMENT

Figure 10 shows the visualizations of learned option-polices and termination functions in MuJoCo Point Four Rooms, averaged over 100 uniformly sampled states per each position. Arrows show the expected moving directions, computed from rotations and option-policies. Terminating regions and option-policies learned with PPIMOC are diverse. For example, option 1 tends to go down while option 2 tends to go right. In the sampled trajectory of PPIMOC, we can see that it mainly used option 1 but occasionally switched to option 0 and option 2 for reaching the goal, and switched to option 3 around the goal. Contrary, for reaching the goal PPOC only used option 3 that does not terminate in any region. Options learned by Our PPOC is almost the same: termination probability is high around the upper left corner and option-policies direct downward. 

D.6 ABLATION STUDIES

We conducted ablation studies with three variants of A2IMOC/PPIMOC: • c Hµ = 0: Do not use the policy regularization based on MI (13). • N -step Advantange: Use N -step advantage or N -step GAE instead of UOAE ( 12) UGOAE (18). • Truncated N -step Advantange: Compute advantage ignoring future rewards instead of using UOAE or UGOAE. Figure 11 shows all results in three tasks. We can see that UOAE is effective in all tasks, since both N -step advantage and truncated N -step advantage performed worse than UOAE. The policy regularization based on MI ( 13) is effective only in the Point Billiard lifelong learning task.

D.7 NUMBER OF OPTIONS

Figure 12 shows the performance of IMOC with varying the number of options. Two options perfomed worse in all experiments and we need four or more options to make use of IMOC. However, when we increase the number of options to six and eight, we don't see any peroformance improvement from four options, despite of our analysis that we need sufficiently many options to cover the state space Appendix A. This would be an interesting problem for future works. 



Note that we did not include ACTC(Harutyunyan et al., 2019) for comparison since we failed to reliably reproduce the reported results with our implementation of ACTC.2 Note that we choose the best model from multiple trials for visualization throughout the paper. https://github.com/jeanharb/a2oc_delib https://github.com/mklissa/PPOC



Figure 1: Two insances of InfoMax options in the four state deterministic chain. Left: Options are diverse but all state transitions per option are one-step. Right: Options enable relatively long state transitions but option-policies are the same at some states.

Gridworld environment. Blue grid is the start and green grids are goals.

Figure 2: Single Task learning in Gridworld Four Rooms.

Figure 3: Learned option-policies (π o ) and termination probabilities (β o ) for each option in Gridworld Four Rooms after 6 × 10 5 steps. Arrows show the probablities of each action and heatmaps show probabilities of each β o . First row: A2IMOC. Terminating regions are clearly different each other and there are a few overlapped regions. Second row: AOC. Almost all states has high termination probablity with option 0, 1, 3 and option 2 has no terminating region. Third row: Our AOC. Termination regions are not clearly separated per each option.

Figure 4: Single Task learning in MuJoCo Point Four Rooms.

Performance progression in Cartpole swing up.

Figure 5: Single Task learning in MuJoCo Point Four Rooms.

Figure 6: Lifelong learning in MuJoCo Point Billiard.

Figure 7: InfoMax options in the four state deterministic chain. Left: With three options. Right: With four options.

Figure 8: Neural Network architecture used for the Gridworld experiments (top) and the MuJoCo tasks (bottom).

Figure 9: Learned option-policies (π o ) and termination functions (β o ) for each option in Gridworld Four Rooms after 5 × 10 4 steps. Arrows show the probablities of each action and heatmaps show probabilities of each β o . First row: A2IMOC. Option 0 tends to go down, option 1 and 2 tend to go right, and option 3 tends to go up. Second row: AOC. All options tend to go down. Third row: Our AOC. Option 0, 1, and 3 tend to go down and option 2 tends to do right.

Figure 10: Left: Learned option-policies (π o ) and termination functions (β o ) in MuJoCo Point Four Rooms experiment. Arrows show the expected moving direction of the agent and heatmaps show probabilities of each β o . Right: Sampled trajectories of each method. First row: PPIMOC. Terminattion regions are clearly separated and option-polices are diverse. Second row: PPOC. Option 0 terminates at almost everywhere, while option2 and 3 does not terminate anywhere. Third row: Our PPOC. All options are almost the same.

Figure 11: Ablation studies. Top: Peformance progression of A2IMOC in Gridworld Four Rooms. Bottom: Peformance progression of PPIMOC in MuJoCo Point Four Rooms (left) and MuJoCo Point Billiard (right).

Figure 12: Ablation studies. Top: Peformance progression of A2IMOC in Gridworld Four Rooms. Bottom: Peformance progression of PPIMOC in MuJoCo Point Four Rooms (left) and MuJoCo Point Billiard (right).

Algorithm 1 Advantage-Actor InfoMax Option Critic (A2IMOC) 1: Given: Initial option-value Q O , option-policy π o , and termination function β o . 2: Let B O be a replay buffer for storing option-transitions. 3: for k = 1, ... do Receive reward R i and state x i+1 , taking a i ∼ π oi+1 (x i ) Update β o (x i ) via (11) and the maximization of c H β H(β o (x i ))

Hyperparameters of A2IMOC in Gridworld experiments.

Hyperparameters of PPIMOC in MuJoCo and Classical Control experiments.

