VARIATIONAL INTRINSIC CONTROL REVISITED

Abstract

In this paper, we revisit variational intrinsic control (VIC), an unsupervised reinforcement learning method for finding the largest set of intrinsic options available to an agent. In the original work by Gregor et al. (2016) , two VIC algorithms were proposed: one that represents the options explicitly, and the other that does it implicitly. We show that the intrinsic reward used in the latter is subject to bias in stochastic environments, causing convergence to suboptimal solutions. To correct this behavior and achieve the maximal empowerment, we propose two methods respectively based on the transitional probability model and Gaussian mixture model. We substantiate our claims through rigorous mathematical derivations and experimental analyses.

1. INTRODUCTION

Variational intrinsic control (VIC) proposed by Gregor et al. (2016) is an unsupervised reinforcement learning algorithm that aims to discover as many intrinsic options as possible, i.e., the policies with a termination condition that meaningfully affect the world. The main idea of VIC is to maximize the mutual information between the set of options and final states, called empowerment. The maximum empowerment is desirable because it maximizes the information about the final states the agent can achieve with the available options. These options are independent of the extrinsic reward of the environment, so they can be considered as the agent's universal knowledge about the environment. The concept of empowerment has been introduced in (Klyubin et al., 2005; Salge et al., 2014) along with methods for measuring it based on Expectation Maximization (Arimoto, 1972; Blahut, 1972) . They defined the option as a sequence of a fixed number of actions. Yeung (2008) proposed to maximize the empowerment using the Blahut & Arimoto (BA) algorithm, but its complexity increases exponentially with the sequence length, rendering it impractical for high dimensional and long-horizon options. Mohamed & Rezende (2015) adopted techniques from deep learning and variational inference (Barber & Agakov, 2003) and successfully applied empowerment maximization for high dimensional and long-horizon control. However, this method maximizes the empowerment over open-loop options, meaning that the sequence of action is chosen in advance and conducted regardless of the (potentially stochastic) environment dynamics. This often impairs the performance, as the agent cannot properly react to the environment, leading to a significant underestimation of empowerment (Gregor et al., 2016) . To overcome this limitation, Gregor et al. (2016) proposed to use the closed-loop options, meaning that the sequence of action is chosen considering the transited states. This type of options differ from those in Klyubin et al. (2005) , Salge et al. (2014) and Mohamed & Rezende (2015) in that they have a termination condition, instead of a fixed number of actions. They presented two algorithms: VIC with explicit and implicit options (we will call them explicit and implicit VIC from here on). Explicit VIC defines a fixed number of options, and each option is sampled at the beginning of the trajectory, conditioning the policies of an agent until termination. In other words, both the state and the sampled option are the input to the policy function of the agent. One clear limitation of explicit VIC is that it requires the preset number of options. This does not only apply to explicit VIC, but also to some recent unsupervised learning algorithms that adopt a discrete option or skill with a predefined set (Machado et al., 2017; Eysenbach et al., 2018) . Obviously, presetting the number of options limits the number of options that an agent can learn, impeding the maximal level of empowerment. Moreover, choosing a proper number of options is not straightforward, since the maximum of the objective for a given number of options depends on several unknown environmental factors such as the cardinality of the state space and the transitional model. To overcome this issue, Gregor et al. (2016) proposed implicit VIC which defines the option as the trajectory (i.e., the sequence of states and actions) until termination. There exist multiple trajectories that lead to the same final state and implicit VIC learns to maximize the mutual information between the final state and the trajectory by controlling the latter. As a result, the number of options is no longer limited by the preset number, and neither is the empowerment. Despite this advantage, however, these implicit options induce bias in the intrinsic reward and hinder implicit VIC from achieving maximal empowerment. This effect grows with the stochasticity of the environment, and it may cause serious degradation of empowerment, limiting the agent from learning the universal knowledge of the environment. In this paper, we aim to solve such empowerment degradation of implicit VIC under stochastic dynamics. To this end, we revisit variational intrinsic control and make the following contributions: 1. We show that the intrinsic reward in implicit VIC suffers from the variational bias in stochastic environments, causing convergence to suboptimal solutions (Section 2). 2. To compensate this bias and achieve the maximal empowerment, we suggest two modifications of implicit VIC: the environment dynamics modeling with the transitional probability (Section 3) and Gaussian mixture model (Section 4).

2. VARIATIONAL BIAS OF IMPLICIT VIC IN STOCHASTIC ENVIRONMENTS

In this section, we derive the variational bias of implicit VIC under stochastic environment dynamics. First, we adopt the definition of termination action and final state from Gregor et al. (2016) and briefly review VIC. The termination action terminates the option and yields the final state s f = s t independently of the environmental action space. VIC aims to maximize the empowerment, i.e., the mutual information between option Ω and final state s f , which can be written as follows: I(Ω, s f |s 0 ) = - Ω p(Ω|s 0 ) log p(Ω|s 0 ) + Ω,s f p(s f |s 0 , Ω)p(Ω|s 0 ) log p(Ω|s 0 , s f ). Since p(Ω|s 0 , s f ) is intractable, VIC (Gregor et al., 2016) derives the variational bound I V B ≤ I and maximizes it instead: I V B (Ω, s f |s 0 ) = - Ω p(Ω|s 0 ) log p(Ω|s 0 ) + Ω,s f p(s f |s 0 , Ω)p(Ω|s 0 ) log q(Ω|s 0 , s f ), where q(Ω|s 0 , s f ) is the inference model to be trained. When I V B is maximized, we have q(Ω|s 0 , s f ) = p(Ω|s 0 , s f ) and achieve the maximum I. As explained in Section 1, explicit VIC samples an explicit option at the beginning of a trajectory and it conditions policy as π(a|s, Ω) until termination. Due to the randomness of policy, the final state is undetermined for a given option until the policy converges to achieve a specific option. Unlike explicit VIC, implicit VIC defines its option as a trajectory until termination, and hence, the final state is determined for a given option. This can be expressed as p(s f |Ω, s 0 ) = 1, if s f = s f |Ω 0, otherwise where s f |Ω is the final state of an option Ω. This is a key characteristic of implicit VIC and is essential for deriving the main results of this paper. We will often use this equation to eliminate s f for reduction. Interestingly, this makes a difference in empowerment maximization between explicit and implicit VIC, which can be explained by rewriting I(Ω, s f |s 0 ) as follows: I(Ω, s f |s 0 ) = H(s f |s 0 ) -H(s f |Ω, s 0 ). Note that H(s f |Ω, s 0 ) is 0 for implicit VIC since s f is determined for a given  The intrinsic reward of implicit VIC (Gregor et al., 2016) is given by r I V B Ω = (τt,at,st+1)∈Ω log q(a t |τ t , s f |Ω ) p(a t |τ t ) , where q(a t |τ t , s f ) is inference and p(a t |τ t ) is policy of an agent. It can be shown that r I V B Ω comes from the first part of (6) (see Appendix A for details). Under deterministic environment dynamics, the transitional part log p(s t+1 |τ t , a t , s f |Ω )/p(s t+1 |τ t , a t ) is canceled out since both the nominator and denominator are always 1. However, this is not possible under stochastic environment dynamics and it yields the variational bias b V IC Ω in the intrinsic reward: b V IC Ω = (τt,at,st+1)∈Ω log p(s t+1 |τ t , a t , s f |Ω ) p(s t+1 |τ t , a t ) . From ( 8), we see that this bias comes from the difference between the transitional probabilities with and without the given final state. This difference is large when s f |Ω in the nominator plays a crucial role, which then causes a large bias. One extreme case is when s t+1 is the necessary transition to reach s f |Ω but not the only transition from (τ t , a t ). Then, p(s t+1 |τ t , a t , s f |Ω ) is 1 but p(s t+1 |τ t , a t ) is not, yielding a large bias. In Section 5, we provide the experimental evidence that this variational bias leads to a suboptimal training. Even though the original VIC (Gregor et al., 2016 ) subtracts b(s 0 ) from r I V B

Ω

to reduce the variance of learning, it cannot compensate this bias since it also depends on Ω. In the next section, we analyze the mutual information (1) in more detail under stochastic environment dynamics and define the variational estimate of (1), I V E , for training.

3. IMPLICIT VIC WITH TRANSITIONAL PROBABILITY MODEL

In this section, we analyze I(Ω, s f |s 0 ) under stochastic environment dynamics and propose to explicitly model transitional probabilities to fix variational bias. First, for a given option and final state, we define p π (Ω|s f , s 0 ), p ρ (Ω|s f , s 0 ), p π (Ω|s 0 ) and p π (Ω|s 0 ) as follows: p π (Ω|s f , s 0 ) = (τt,at,st+1)∈Ω p π (a t |τ t , s f ), p ρ (Ω|s f , s 0 ) = (τt,at,st+1)∈Ω p ρ (s t+1 |τ t , a t , s f ), p π (Ω|s 0 ) = (τt,at,st+1)∈Ω p π (a t |τ t ), p ρ (Ω|s 0 ) = (τt,at,st+1)∈Ω p ρ (s t+1 |τ t , a t ) . ( ) Note that p(Ω|s f , s 0 ) = p π (Ω|s f , s 0 )p ρ (Ω|s f , s 0 ) is the true distribution of Ω for given s f where p π (Ω|s f , s 0 ) is a policy-related part and p ρ (Ω|s f , s 0 ) is a transitional part of p(Ω|s f , s 0 ) and so do p(Ω|s 0 ), p π (Ω|s 0 ) and p ρ (Ω|s 0 ). It is necessary to consider transitional probabilities since they induce variational bias in intrinsic reward. Hence we model ( 9) as follows: π q (Ω|s f , s 0 ) = (τt,at,st+1)∈Ω π q (a t |τ t , s f ), ρ q (Ω|s f , s 0 ) = (τt,at,st+1)∈Ω ρ q (s t+1 |τ t , a t , s f ), π p (Ω|s 0 ) = (τt,at,st+1)∈Ω π p (a t |τ t ), ρ p (Ω|s 0 ) = (τt,at,st+1)∈Ω ρ p (s t+1 |τ t , a t ), where π q , ρ q , π p and ρ p are our models to be trained. We know the policy of an agent, so we have p π (a t |τ t ) = π p (a t |τ t ). For the other probabilities, they are trained based on our algorithms. Now we can rewrite I(Ω, s f |s 0 ) as below using (9): I(Ω, s f |s 0 ) = Ω,s f p(Ω, s f |s 0 ) log p ρ (Ω|s f , s 0 )p π (Ω|s f , s 0 ) -log p ρ (Ω|s 0 )p π (Ω|s 0 ) . (11) Using ( 10), we define I V E as follows: I V E (Ω, s f |s 0 ) = Ω,s f p(Ω, s f |s 0 ) log ρ q (Ω|s f , s 0 )π q (Ω|s f , s 0 ) -log ρ p (Ω|s 0 )π p (Ω|s 0 ) . (12) This is an estimate of the mutual information between Ω and s f with transitional models. Note that I V E is not a variational lower bound on I unlike I V B , hence the derivation of VIC in Gregor et al. (2016) is not applicable in this case. To tackle this problem, we start from absolute difference, |I -I V E | which can be bounded as I -I V E ≤ U V E = s f p(s f |s 0 )D KL p π (•|s f , s 0 )p ρ (•|s f , s 0 ) π q (•|s f , s 0 )ρ q (•|s f , s 0 ) + D KL p π (•|s 0 )p ρ (•|s 0 ) π p (•|s 0 )ρ p (•|s 0 ) . ( ) See Appendix B for the derivation. Note that U V E is the sum of positively weighted KL divergences, which means that it satisfies U V E → 0 if and only if π q (•|s f , s 0 ) → p π (•|s f , s 0 ), ρ q (•|s f , s 0 ) → p ρ (•|s f , s 0 ) and ρ p (•|s 0 ) → p ρ (•|s 0 ) for all s f . In other words, our estimate of the mutual information converges to the true value as our estimates (10) converge to the true distribution (9). It makes sense that we can estimate the true value of the mutual information if we know the true distribution. Hence we minimize U V E with respect to π q , ρ q and ρ p . If π q , ρ q and ρ p are parameterized by θ q π , θ q ρ and θ p ρ , we can obtain gradients of U V E using (3) as follows: ∇ θ q π U V E = - Ω p(Ω|s 0 )∇ θ q π log π q (Ω|s f |Ω , s 0 ), ∇ θ q ρ U V E = - Ω p(Ω|s 0 )∇ θ q ρ log ρ q (Ω|s f |Ω , s 0 ), ∇ θ p ρ U V E = - Ω p(Ω|s 0 )∇ θ p ρ log ρ p (Ω|s 0 ), which can be estimated from sample mean. Once we have (π q (•|s f , s 0 ), ρ q (•|s f , s 0 ), ρ p (•|s 0 )) ≈ (p π (•|s f , s 0 ), p ρ (•|s f , s 0 ), p ρ (•|s 0 )) for all s f , we can update the policy to maximize I. If π p is parameterized by θ p π , the gradients, ∇ θ p π I and ∇ θ p π I V E can be obtained as follows using (3): ∇ θ p π I= Ω p(Ω|s 0 ) log p π (Ω|s f |Ω , s 0 )p ρ (Ω|s f |Ω , s 0 )-log π p (Ω|s 0 )p ρ (Ω|s 0 ) r I Ω ∇ θ p π log π p (Ω|s 0 ), ∇ θ p π I VE = Ω p(Ω|s 0 ) log π q (Ω|s f |Ω , s 0 )ρ q (Ω|s f |Ω , s 0 )-log π p (Ω|s 0 )ρ p (Ω|s 0 ) r I V E Ω ∇ θ p π log π p (Ω|s 0 ), where p π (Ω|s 0 ) is replaced by π p (Ω|s 0 ) since we know the true value of policy (see Appendix A for details). From (15), we see that ∇ θ p π I V E → ∇ θ p π I as π q (•|s f , s 0 ) → p π (•|s f , s 0 ), ρ q (•|s f , s 0 ) → p ρ (•|s f , s 0 ) and ρ p (•|s 0 )) → p ρ (•|s 0 ) for all s f . It means that we can estimate the correct gradient of mutual information w.r.t policy as our estimates (10) converge to the true distribution (9). Note that for deterministic environments, we can omit ρ q (•|s f |• , s 0 ) and ρ p (•|s 0 ) since they are always 1. Then we have I V E = I V B and ∇ θ q π U V E = -∇ θ q π I V B , which means that maximizing I V B is equivalent to minimizing U V E for θ q π (i.e., it is equivalent to the original implicit VIC). Finally, Algorithm 1 summarizes the modified implicit VIC with transitional probability model. The additional steps added to the original implicit VIC are marked with ( * ). Note that Algorithm 1 is not always practically applicable since it is hard to model p(s t+1 |τ t , a t ) and p(s t+1 |τ t , a t , s f ) when the cardinality of the state space is unknown. (We can not set the number of nodes for softmax.) In our experiment of Algorithm 1, we will assume that we know the cardinality of the state space. This allows us to model p ρ (s t+1 |τ t , a t ) and p ρ (s t+1 |τ t , a t , s f ) using sof tmax. In the next section, we propose a practically applicable method that avoids this intractability of the cardinality. Algorithm 1 Implicit VIC with transitional probability model Initialize s 0 , η, T train , θ q π , θ q ρ , θ p π and θ p ρ . for i train : 1 to T train do Follow π p (a t |τ t ) result in Ω = (s 0 , a 0 , ..., s f ). r I V E Ω ← t [log π q (a t |τ t , s f ) -log π p (a t |τ t )] from (15) r I V E Ω ← r I V E Ω + t [log ρ q (s t+1 |τ t , a t , s f ) -log ρ p (s t+1 |τ t , a t )] from ( 15), (*) Update each parameter: 14), (*) θ q ρ ← θ q ρ + η∇ θ q ρ t log ρ q (s t+1 |τ t , a t , s f ) from ( 14), (*) end end for θ p π ← θ p π + ηr I V E Ω ∇ θ p π t log π p (a t |τ t ) from (15) θ q π ← θ q π + η∇ θ q π t log π q (a t |τ t , s f ) from (14) θ p ρ ← θ p ρ + η∇ θ p ρ t log ρ p (s t+1 |τ t , a t ) from (

4. IMPLICIT VIC WITH GAUSSIAN MIXTURE MODEL

In this section, we propose the alternative method to overcome the limitation of Algorithm 1 by modeling the smoothed transitional distributions. This allows us to use the Gaussian Mixture Model (GMM) (Pearson, 1894) and other continuous distributional models for modelling transitional distribution. First, we smooth p(s t+1 |τ t , a t , s f ) and p(s t+1 |τ t , a t ) into f σ (x t+1 |τ t , a t , s f ) and f σ (x t+1 |τ t , a t ) with x t+1 = s t+1 + z t+1 and z t+1 ∼ N (0, σ 2 I d ): f σ (x t+1 |τ t , a t , s f ) = s ∈S(τt,at,s f ) p(s |τ t , a t , s f )f σ (x t+1 -s ; 0, σ 2 I d ), f σ (x t+1 |τ t , a t ) = s ∈S(τt,at) p(s |τ t , a t )f σ (x t+1 -s ; 0, σ 2 I d ), where S(τ t , a t , s f ) = s |p(s |τ t , a t , s f ) > 0 , S(τ t , a t ) = s |p(s |τ t , a t ) > 0 and d is the dimension of the state. Then, using Gaussian Mixture Model (GMM) (Pearson, 1894) , we model ( 16) as f q σ (x t+1 |τ t , a t , s f ) and f p σ (x t+1 |τ t , a t ): f q σ (x t+1 |τ t , a t , s f ) = ngmm i=1 w i (τ t , a t , s f )f σ (x t+1 ; µ i (τ t , a t , s f ), σ 2 I d ), f p σ (x t+1 |τ t , a t ) = ngmm i=1 w i (τ t , a t )f σ (x t+1 ; µ i (τ t , a t ), σ 2 I d ). Note that if we set n gmm > max τt,at |S(τ t , a t )|, (17) can perfectly fit (16). Now using (17), we define I V E σ , the variational estimate with smoothing as follows: I V E σ = Ω,s f p(Ω, s f |s 0 ) log π q (Ω|s f , s 0 )f q σ (Ω|s f , s 0 ) -log π p (Ω|s 0 )f p σ (Ω|s 0 ) . ( ) Note that I V E σ is not a variational lower bound on I, hence we can not also apply derivation of implicit VIC in Gregor et al. (2016) . As in Section 3, we start from the absolute difference between I and I V E σ . The upper bound on |I -I V E σ | can be obtained as follows: I -I V E σ ≤ U V E σ,1 + U V E σ,2 with U V E σ,1 = Ω,s f p(Ω, s f |s 0 ) log p ρ (Ω|s f , s 0 ) p ρ (Ω|s 0 ) f p σ (Ω|s 0 ) f q σ (Ω|s f , s 0 ) , U V E σ,2 = s f p(s f |s 0 )D KL p π (•|s f , s 0 )p ρ (•|s f , s 0 )||π q (•|s f , s 0 )p ρ (•|s f , s 0 ) . See Appendix C for the derivation. Note that U V E σ,2 differs from the first part of U V E in (13). This upper bound implies U V E σ,1 → 0 as f q σ (•|s f , s 0 )/f p σ (•|s 0 ) → p ρ (•|s f , s 0 )/p ρ (•|s 0 ) for all s f and U V E σ,2 → 0 if and only if π q (•|s f , s 0 ) → p π (•|s f , s 0 ) for all s f since U V E σ,2 is the sum of positively weighted KL divergences. To estimate the correct value of mutual information from (18), we minimize U V E σ,1 and U V E σ,2 . The gradient of U V E σ,2 can be obtained as below using (3): ∇ θ q π U V E σ,2 = - Ω p(Ω|s 0 )∇ θ q π log π q (Ω|s f |Ω , s 0 ) (20) which can be estimated from the sample mean. As Algorithm 1, it satisfies ∇ θ q π U V E σ,2 = -∇ θ q π I V B , which means that minimizing U V E σ,2 is equivalent to maximizing I V B with respect to θ q π . Since U V E σ,2 is 0 if and only if π q (•|s f , s 0 ) = p π (•|s f , s 0 ) for all s f , this update will make π q (•|s f , s 0 ) converge to p π (•|s f , s 0 ) for all s f . Unlike U V E σ,2 , it is difficult to directly minimize U V E σ,1 due to the absolute value function. However, it can be minimized by estimating the correct ratio between transitional distributions, p ρ (•|s f , s 0 )/p ρ (•|s 0 ), even though each of their true values is unknown. To estimate this ratio, we fit ( 17) to ( 16) by minimizing D KL [f σ (•|τ t , a t , s f ) f q σ (•|τ t , a t , s f )] and D KL [f σ (•|τ t , a t ) f p σ (•|τ t , a t )]. If f q σ and f p σ are parameterized by θ q ρ and θ p ρ , the gradients of KL divergences can be obtained as follows: ∇ θ q ρ D KL [f σ (•|τ t , a t , s f ) f q σ (•|τ t , a t , s f )] = - xt+1 f σ (x t+1 |τ t , a t , s f )∇ θ q ρ log f q σ (x t+1 |τ t , a t , s f ), ∇ θ p ρ D KL [f σ (•|τ t , a t ) f p σ (•|τ t , a t )] = - xt+1 f σ (x t+1 |τ t , a t )∇ θ p ρ log f p σ (x t+1 |τ t , a t ), which can be estimated from the sample mean. As our estimated smoothed transitional distribution (17) converges to the true smoothed transitional distribution ( 16), it satisfies f q σ (•|s f , s 0 )/f p σ (•|s 0 ) → p ρ (•|s f , s 0 )/p ρ (•|s 0 ) which leads to U V E σ,1 → 0 for finite T and σ d min where T is the average length of trajectories and d min is the minimal distance between two different states (see appendix D for details). Hence, for small enough σ and finite length of trajectories, we can minimize U V E σ,1 to nearly zero. This implies that if we smooth the original transitional distribution with smaller σ, the smoothed transition will be sharper and the ratio between the transitional probabilities will be estimated more accurately using them. Once we have (π q (•|s f , s 0 ), f q σ (•|s f , s 0 ), f p σ (•|s 0 )) ≈ (p π (•|s f , s 0 ), f σ (•|s f , s 0 ), f σ (•|s 0 )) for all s f after the update by ( 20) and ( 21), we have I V E σ ≈ I. Now, we can update the policy by obtaining ∇ θ p π I V E σ using (3): ∇ θ p π I = Ω p(Ω|s 0 ) log p π (Ω|s f , s 0 ) π p (Ω|s 0 ) + log p ρ (Ω|s f , s 0 ) p ρ (Ω|s 0 ) r I Ω ∇ θ p π log π p (Ω|s 0 ), ∇ θ p π I V E σ = Ω p(Ω|s 0 ) log π q (Ω|s f |Ω , s 0 ) π p (Ω|s 0 ) + log f q σ (Ω|s f |Ω , s 0 ) f p σ (Ω|s 0 ) r I V E σ Ω ∇ θ p π log π p (Ω|s 0 ) where ∇ θ p π I is rewritten from (15). From ( 22), we see that it satisfies ∇ θ p π I V E σ → ∇ θ p π I as π q (•|s f , s 0 ) → p π (•|s f , s 0 ) and f q σ (•|s f , s 0 )/f p σ (•|s 0 ) → p ρ (•|s f , s 0 )/p ρ (•|s 0 ) for all s f , which can be achieved by ( 20) and ( 21) as explained previously. Hence, we can estimate the correct gradient of mutual information w.r.t our policy using the estimated smoothed transitional distributions for finite T and σ d min . However, choosing an appropriate value of σ is not straightforward since d min and T are usually unknown and depend on the environment. Besides, too small σ makes the training of f q σ and f p σ unstable due to the extreme gradient of Gaussian function near the mean. Another issue of GMM is the choice of a proper n gmm of ( 17). As explained previously, we can perfectly fit ( 17) to ( 16) for n gmm > max τt,at |S(τ t , a t )|. We may choose a very large n gmm for the perfect fit but it makes training hard for its complexity. We leave the proper choice of σ and n gmm as future works and use empirically chosen values (σ = 0.25 and n gmm = 10) in this paper. Finally, we summarize our method in Algorithm 2. Additional steps added to the original implicit VIC are marked with ( * ). Algorithm 2 Implicit VIC with Gaussian mixture model Initialize s 0 , η, T train , T smooth , θ q π , θ q ρ , θ p π and θ p ρ . for i train : 1 to T do Follow π p (a t |τ t ) result in Ω = (s 0 , a 0 , ..., s f ). r 22), (*) Update each parameter: θ p π ← θ p π + ηr I V E σ Ω ← t [log π q (a t |τ t , s f ) -log π p (a t |τ t )] from (22) r I V E σ Ω ← r I V E σ Ω + t [log f q σ (s t+1 |τ t , a t , s f ) -log f p σ (s t+1 |τ t , a t )] from ( I V E σ Ω ∇ θ p π t log π p (a t |τ t ) from (22) θ q π ← θ q π + η∇ θ q π t log π q (a t |τ t , s f ) from (20) ∆θ p ρ ← 0 (*) ∆θ q ρ ← 0 (*) for i smooth : 1 to T smooth do (*) Sample (z 1 , z 2 , ..., z f ), z i ∼ N (0, σ 2 I n ) ∆θ p ρ ← ∆θ p ρ + η∇ θ p ρ t log f p σ (s t+1 + z t+1 |τ t , a t ) from (21) ∆θ q ρ ← ∆θ q ρ + η∇ θ q ρ t log f q σ (s t+1 + z t+1 |τ t , a t , s f ) from (21) end for θ p ρ ← θ p ρ + ∆θ p ρ /T smooth (*) θ q ρ ← θ q ρ + ∆θ q ρ /T smooth (*) end end for

5. EXPERIMENTS

In this section, we evaluate implicit VIC (Gregor et al., 2016) , Algorithm 1 and Algorithm 2. We use LSTM (Hochreiter & Schmidhuber, 1997) to encode τ t = (s 0 , a 0 , ..., s t ) into a vector. We conduct experiments on both deterministic and stochastic environments and evaluate results by measuring the mutual information I from samples. To measure I, we rewrite (1) using (3) as follows: I(Ω, s f |s 0 ) = - Ω,s f p(s f |s 0 ) log p(s f |s 0 ) + Ω,s f p(Ω|s 0 )p(s f |Ω, s 0 ) log p(s f |Ω, s 0 ) = - s f p(s f |s 0 ) log p(s f |s 0 ) which is maximized when s f is distributed uniformly. We estimate Î using the distribution of s f from the samples, i.e., p(s f |s 0 ). We apply exponential moving average (0.99 as a smoothing factor) to an average of 5 repetitions for estimating p(s f |s 0 ). We manually set T max (maximum length of a trajectory) for each experiment such that the termination action is the only available action at T max th action. For training GMM of Algorithm 2, the transitional models are trained to predict the distribution of ∆x t+1 = x t+1 -s t instead of x t+1 since predicting difference is usually easier than predicting the whole state. For the training, we have a warm-up phase which trains the base function b(s 0 ) in Gregor et al. (2016) and the transitional models. After the warm-up phase, we update the base function, policy and transitional models simultaneously. Please see Appendix F.1 for details on the hyper-parameter settings. We compare the algorithms in deterministic environments in Fig. 1 . Note that although ( 8) is zero for deterministic environments, we still train the transitional models of Algorithm 1 and 2 to show the convergence to the optimum of them. Fig. 2 shows that all three algorithms rapidly achieve the maximal empowerment. We can observer that all the states in environments are visited uniformly after training which is achieved when the mutual information is maximized. Fig. 3 shows the training results of implicit VIC and Algorithm 2 in the 25 × 25 grid world with 4 rooms used in Gregor et al. (2016) . Both implicit VIC and our Algorithm 2 successfully learn passing narrow doors between rooms and visit almost the whole reachable states for a given T max = 25. The additional results in the Mujoco (Todorov et al., 2012) showing the applicability of our Algorithm 2 are in appendix E. We compare the algorithms in stochastic environments. Please see appendix F.2 for the details on the stochasticity of the environments. Fig. 4 shows results in simple stochastic environments. We see that while implicit VIC converges to sub-optimum, our two algorithms achieve the maximal empowerment. In Fig. 5 , implicit VIC fails to reach far rooms, whereas our Algorithm 2 reaches every room in the environment. From Fig. 4 and 5, we can notice that implicit VIC fails to reach far states under the stochastic dynamics. It happens because the variational bias of implicit VIC is accumulated throughout a trajectory, i.e., it gets larger as the length of the trajectory increases. Once the agent enters a room, the environment returns done and then the final action is available only. Green shows the distribution of s f . The external reward is composed of -0.1 for every time step, +1 for entering the normal room and +100 for entering the special room. The sub-optimal solution is reaching the closest room. The optimal solution is trying to reach the special room (the room with the red box) while it enters the closest normal room as soon as possible when it is impossible to reach there due to the stochastic transition.

6. CONCLUSION

In this work, we revisited variational intrinsic control (VIC) proposed by Gregor et al. (2016) . We showed that for VIC with implicit options, the environmental stochasticity induces a variational bias in the intrinsic reward, leading to convergence to sub-optima. To reduce this bias and achieve maximal empowerment, we proposed to model the environment dynamics using either the transitional probability model or the Gaussian mixture model. Evaluations on stochastic environments demonstrated the superiority of our methods over the original VIC algorithm with implicit options. Ω,s f p θ (Ω, s f ) log q(Ω|s f ) -log p θ (Ω) ∇ θ log p θ (Ω, s f ) = Ω p θ (Ω)r I V E Ω ∇ θ log p θ (Ω) where r I V E Ω = log q(Ω|s f |Ω ) -log p θ (Ω). Using (4) and ( 5  r I V E Ω = τt,at,st+1∈Ω log q φ (a t |τ t , s f |Ω ) p θ (a t |τ t ) = r I V B Ω . B DERIVATION OF U V E Here we derive U V E from |I -I V E | with omitted s 0 for simplicity: I -I V E = Ω,s f p(Ω, s f ) log p π (Ω|s f )p ρ (Ω|s f ) p q π (Ω|s f )p q ρ (Ω|s f ) -log p π (Ω)p ρ (Ω) p p π (Ω)p p ρ (Ω) ≤ Ω,s f p(Ω, s f ) log p π (Ω|s f )p ρ (Ω|s f ) p q π (Ω|s f )p q ρ (Ω|s f ) + Ω,s f p(Ω, s f ) log p π (Ω)p ρ (Ω) p p π (Ω)p p ρ (Ω) = Ω,s f p(s f )p(Ω|s f ) log p π (Ω|s f )p ρ (Ω|s f ) p q π (Ω|s f )p q ρ (Ω|s f ) + Ω,s f p(Ω)p(s f |Ω) log p π (Ω)p ρ (Ω) p p π (Ω)p p ρ (Ω) . Using (3), ( 9) and ( 10), we obtain I -I V E ≤ Ω,s f p(s f )p(Ω|s f ) log p π (Ω|s f )p ρ (Ω|s f ) p q π (Ω|s f )p q ρ (Ω|s f ) + Ω,s f p(Ω)p(s f |Ω) log p π (Ω)p ρ (Ω) p p π (Ω)p p ρ (Ω) = Ω,s f p(s f )p(Ω|s f ) log p π (Ω|s f )p ρ (Ω|s f ) p q π (Ω|s f )p q ρ (Ω|s f ) + Ω p(Ω) log p π (Ω)p ρ (Ω) p p π (Ω)p p ρ (Ω) (∵ (3)) = s f p(s f )D KL p π (•|s f )p ρ (•|s f )||p q π (•|s f )p q ρ (•|s f ) + D KL p π (•)p ρ (•)||p p π (•)p p ρ (•) =U V E . C DERIVATION OF U V E σ,1 AND U V E σ,2 Here we derive (19). We also omit s 0 for simplicity here. We start from I -I V E σ : with T max = max Ω T Ω and T = Ω,s f p(Ω, s f |s 0 )T Ω . This implies that the estimation error of mutual information with smoothing converges to 0 for fine T max or T as σ → 0. Also, for finite T and σ d min , it satisfies|I -I σ | ≈ 0. I -I V E σ = Ω,s f p(Ω, s f ) log p π (Ω|s f )p ρ (Ω|s f ) p q π (Ω|s f )f q σ (Ω|s f ) -log p π (Ω)p ρ (Ω) p p π (Ω)f p σ (Ω) = Ω,s f p(Ω, s f ) log p π (Ω|s f )p ρ (Ω|s f ) p q π (Ω|s f )f q σ (Ω|s f ) -log p ρ (Ω) f p σ (Ω) (∵ p p π (Ω) = p π (Ω)) = Ω,s f p(Ω, s f ) log p ρ (Ω|s f )f p σ (Ω) p ρ (Ω)f q σ (Ω|s f ) + log p π (Ω|s f )p ρ (Ω|s f ) p q π (Ω|s f )p ρ (Ω|s f ) ≤ Ω,



(a) Deterministic 1D world. (b) Deterministic 2D world. (c) Deterministic tree world.

Figure 1: Deterministic environments. Fig. 1a shows the deterministic 1D world. The agent can go left right. Fig. 1b shows the deterministic 2D. The agent can go left, up, right and down. Fig 1c shows the deterministic tree world. The agent can go left and right.

Figure 2: Estimated empowerment during the training in deterministic environments. Fig. 2a shows the deterministic 1D world and its training results. The agent can go left right. Fig. 2b shows the deterministic 2D world and its training results. The agent can go left, up, right and down. Fig 2c shows the deterministic tree world and its training results. The agent can go left and right. Green shows the distribution of s f .

Figure 3: Deterministic grid world with 4 rooms. The environment is a 25 × 25 grid world with 4 rooms. The agent can go left, up, right and down. The agent starts from (4, 4) and (10, 4) with T max = 25. Green shows the distribution of s f .

Fig. 5 also shows the results of training with external rewards. The same mixed reward is used as Gregor et al. (2016), r = r I + αr E with α = 30. For training the random agent, only αr E is used and the entropy loss was applied for exploration. While the random agent and implicit VIC converge to sub-optimum, our Algorithm 2 achieves the optimal solution.

(a) Stochastic 1D world with Tmax = 5. Algorithm 1 and 2 achieve ∼196% empowerment gain compared to implicit VIC. (b) Stochastic 2D world with Tmax = 5. Algorithm 1 and 2 achieve ∼142% empowerment gain compared to implicit VIC. (c) Stochastic tree world with Tmax = 4. Algorithm 1 and 2 achieve ∼406% empowerment gain compared to implicit VIC.

Figure 4: Estimated empowerment during the training in stochastic environments. The environments are equal to Fig. 1 except for their stochasticity. Green shows the distribution of s f .

Figure 5: Stochastic gird world with 35 rooms. The environment is a 15 × 15 grid world with 35 rooms (black cells surrounded by gray walls). We set T max = 25. The agent can go up, down and right. It starts from (0, 6).Once the agent enters a room, the environment returns done and then the final action is available only. Green shows the distribution of s f . The external reward is composed of -0.1 for every time step, +1 for entering the normal room and +100 for entering the special room. The sub-optimal solution is reaching the closest room. The optimal solution is trying to reach the special room (the room with the red box) while it enters the closest normal room as soon as possible when it is impossible to reach there due to the stochastic transition.

) and p(Ω|s f , s 0 ). Since implicit VIC defines the option Ω as the trajectory of an agent, i.e., the sequence of states and actions: Ω = (s 0 , a 0 , s 1 , a 1 , ..., s T -1 , a T -1 = a f , s T = s f = s T -1 ), the probability of an option can be decomposed as

(a t |τ t , s f |Ω )p θ (s t+1 |τ t , a t , s f |Ω ) p θ (a t |τ t )p(s t+1 |τ t , a t ) . p θ (a t |τ t , s f |Ω ) is intractable, we may replace it with variational inference q φ (a t |τ t , s f |Ω ) (a t |τ t , s f |Ω )p θ (s t+1 |τ t , a t , s f |Ω ) p θ (a t |τ t )p(s t+1 |τ t , a t ) . deterministic environment, we have p θ (s t+1 |τ t , a t , s f |Ω ) = p(s t+1 |τ t , a t ) = 1 and both r I

DERIVATION OF THE BOUNDS ON I -I σHere we derive the bounds on the estimation error of mutual information with smoothing. First, we derive the upper bound on f σ (s t+1 |τ t , a t ):f σ (s t+1 |τ t , a t ) = s ∈S(τt,at) p(s |τ t , a t )f σ (s t+1 -s ; 0, σ 2 I n ) ) n p(s t+1 |τ t , a t ) + (1 -p(s t+1 |τ t , a t )) exp (-d 2 ) n p(s t+1 |τ t , a t ) ≤ f σ (s t+1 |τ t , a t ) which results in: p(s t+1 |τ t , a t ) ≤ (2πσ 2 ) n f σ (s t+1 |τ t , a t ) ≤ p(s t+1 |τ t , a t ) + exp (-d 2Similarly, we can obtain the bounds of f σ (s t+1 |τ t , a t , s f ) as follows:p(s t+1 |τ t , a t , s f ) ≤ (2πσ 2 ) n f σ (s t+1 |τ t , a t , s f ) ≤ p(s t+1 |τ t , a t , s f ) + exp (-d 2 min 2σ 2 ). (25) Combining (24) and (25), we can obtain p(s t+1 |τ t , a t , s f ) p(s t+1 |τ t , a t ) + exp (-≤ f σ (s t+1 |τ t , a t , s f ) f σ (s t+1 |τ t , a t ) ≤ p(s t+1 |τ t , a t , s f ) + exp (-t+1 |τ t , a t , s f ) . (26) Taking log and using log (a + b) ≤ log a + b a for a, b > 0 to (26), we have log p(s t+1 |τ t , a t , s f ) p(s t+1 |τ t , a t ) (s t+1 |τ t , a t , s f ) f σ (s t+1 |τ t , a t ) ≤ log p(s t+1 |τ t , a t , s f ) p(s t+1 |τ t , a t ) t+1 |τ t , a t , s f ) p(s t+1 |τ t , a t ) -log f σ (s t+1 |τ t , a t , s f ) f σ (s t+1 |τ t , a t )

ACKNOWLEDGMENTS

This research is conducted with the support of NC. We thank Seong Hun Lee at University of Zaragoza for his sincere feedback on our work, Yujeong Lee at KL Partners for her encouragement and Seungeun Rho at Seoul National University, Jinyun Chung, Yongchan Park, Hyunsoo Park, Sangbin Moon, Inseok Oh, Seongho Son and Minkyu Yang at NC for their useful comments.

E ADDITIONAL EXPERIMENT RESULTS

Here we show additional experiment results in HalfCheetah-v3 in the Mujoco environments (Todorov et al., 2012) to show the applicability of our Algorithm 2. We expect that both implicit VIC and our Algorithm 2 will show similar results as Fig. 1 and Fig. 2 since it is a deterministic environment. We fixed the length of the trajectory as T = 100 to force the agent to learn long enough trajectories. Each motor's action space is quantized to 5 actions. We can observe that exciting movements (especially triple backflips) are learned by Algorithm 2. Another exciting fact is that since the number of options it can learn is unlimited, it shows newer behaviors on and on as learning goes on. 

F EXPERIMENTAL DETAILS

Here we specify experimental details and environment details.

F.1 HYPER-PARAMETERS

Here we show hyper-parameters used in our experiments. We used a learning rate of 1e -3 for Fig. 2 and Fig. 4 and 1e -4 otherwise. Here we specify the transition tables used in this paper. 

