VARIATIONAL INTRINSIC CONTROL REVISITED

Abstract

In this paper, we revisit variational intrinsic control (VIC), an unsupervised reinforcement learning method for finding the largest set of intrinsic options available to an agent. In the original work by Gregor et al. (2016), two VIC algorithms were proposed: one that represents the options explicitly, and the other that does it implicitly. We show that the intrinsic reward used in the latter is subject to bias in stochastic environments, causing convergence to suboptimal solutions. To correct this behavior and achieve the maximal empowerment, we propose two methods respectively based on the transitional probability model and Gaussian mixture model. We substantiate our claims through rigorous mathematical derivations and experimental analyses.

1. INTRODUCTION

Variational intrinsic control (VIC) proposed by Gregor et al. (2016) is an unsupervised reinforcement learning algorithm that aims to discover as many intrinsic options as possible, i.e., the policies with a termination condition that meaningfully affect the world. The main idea of VIC is to maximize the mutual information between the set of options and final states, called empowerment. The maximum empowerment is desirable because it maximizes the information about the final states the agent can achieve with the available options. These options are independent of the extrinsic reward of the environment, so they can be considered as the agent's universal knowledge about the environment. The concept of empowerment has been introduced in (Klyubin et al., 2005; Salge et al., 2014) along with methods for measuring it based on Expectation Maximization (Arimoto, 1972; Blahut, 1972) . They defined the option as a sequence of a fixed number of actions. Yeung (2008) proposed to maximize the empowerment using the Blahut & Arimoto (BA) algorithm, but its complexity increases exponentially with the sequence length, rendering it impractical for high dimensional and long-horizon options. Mohamed & Rezende (2015) adopted techniques from deep learning and variational inference (Barber & Agakov, 2003) and successfully applied empowerment maximization for high dimensional and long-horizon control. However, this method maximizes the empowerment over open-loop options, meaning that the sequence of action is chosen in advance and conducted regardless of the (potentially stochastic) environment dynamics. This often impairs the performance, as the agent cannot properly react to the environment, leading to a significant underestimation of empowerment (Gregor et al., 2016) . To overcome this limitation, Gregor et al. (2016) proposed to use the closed-loop options, meaning that the sequence of action is chosen considering the transited states. This type of options differ from those in Klyubin et al. (2005 ), Salge et al. (2014 ) and Mohamed & Rezende (2015) in that they have a termination condition, instead of a fixed number of actions. They presented two algorithms: VIC with explicit and implicit options (we will call them explicit and implicit VIC from here on). Explicit VIC defines a fixed number of options, and each option is sampled at the beginning of the trajectory, conditioning the policies of an agent until termination. In other words, both the state and the sampled option are the input to the policy function of the agent. One clear limitation of explicit VIC is that it requires the preset number of options. This does not only apply to explicit VIC, but also to some recent unsupervised learning algorithms that adopt a discrete option or skill with a predefined set (Machado et al., 2017; Eysenbach et al., 2018) . Obviously, presetting the number of options limits the number of options that an agent can learn, impeding the maximal level of empowerment. Moreover, choosing a proper number of options is not straightforward, since the maximum of the objective for a given number of options depends on several unknown environmental factors such as the cardinality of the state space and the transitional model. To overcome this issue, Gregor et al. (2016) proposed implicit VIC which defines the option as the trajectory (i.e., the sequence of states and actions) until termination. There exist multiple trajectories that lead to the same final state and implicit VIC learns to maximize the mutual information between the final state and the trajectory by controlling the latter. As a result, the number of options is no longer limited by the preset number, and neither is the empowerment. Despite this advantage, however, these implicit options induce bias in the intrinsic reward and hinder implicit VIC from achieving maximal empowerment. This effect grows with the stochasticity of the environment, and it may cause serious degradation of empowerment, limiting the agent from learning the universal knowledge of the environment. In this paper, we aim to solve such empowerment degradation of implicit VIC under stochastic dynamics. To this end, we revisit variational intrinsic control and make the following contributions: 1. We show that the intrinsic reward in implicit VIC suffers from the variational bias in stochastic environments, causing convergence to suboptimal solutions (Section 2). 2. To compensate this bias and achieve the maximal empowerment, we suggest two modifications of implicit VIC: the environment dynamics modeling with the transitional probability (Section 3) and Gaussian mixture model (Section 4).

2. VARIATIONAL BIAS OF IMPLICIT VIC IN STOCHASTIC ENVIRONMENTS

In this section, we derive the variational bias of implicit VIC under stochastic environment dynamics. First, we adopt the definition of termination action and final state from Gregor et al. ( 2016) and briefly review VIC. The termination action terminates the option and yields the final state s f = s t independently of the environmental action space. VIC aims to maximize the empowerment, i.e., the mutual information between option Ω and final state s f , which can be written as follows: I(Ω, s f |s 0 ) = - Ω p(Ω|s 0 ) log p(Ω|s 0 ) + Ω,s f p(s f |s 0 , Ω)p(Ω|s 0 ) log p(Ω|s 0 , s f ). Since p(Ω|s 0 , s f ) is intractable, VIC (Gregor et al., 2016) derives the variational bound I V B ≤ I and maximizes it instead: I V B (Ω, s f |s 0 ) = - Ω p(Ω|s 0 ) log p(Ω|s 0 ) + Ω,s f p(s f |s 0 , Ω)p(Ω|s 0 ) log q(Ω|s 0 , s f ), where q(Ω|s 0 , s f ) is the inference model to be trained. When I V B is maximized, we have q(Ω|s 0 , s f ) = p(Ω|s 0 , s f ) and achieve the maximum I. As explained in Section 1, explicit VIC samples an explicit option at the beginning of a trajectory and it conditions policy as π(a|s, Ω) until termination. Due to the randomness of policy, the final state is undetermined for a given option until the policy converges to achieve a specific option. Unlike explicit VIC, implicit VIC defines its option as a trajectory until termination, and hence, the final state is determined for a given option. This can be expressed as p(s f |Ω, s 0 ) = 1, if s f = s f |Ω 0, otherwise where s f |Ω is the final state of an option Ω. This is a key characteristic of implicit VIC and is essential for deriving the main results of this paper. We will often use this equation to eliminate s f for reduction. Interestingly, this makes a difference in empowerment maximization between explicit and implicit VIC, which can be explained by rewriting I(Ω, s f |s 0 ) as follows: I(Ω, s f |s 0 ) = H(s f |s 0 ) -H(s f |Ω, s 0 ). Note that H(s f |Ω, s 0 ) is 0 for implicit VIC since s f is determined for a given Ω. One can notice that to maximize empowerment, the agent needs to learn (1) maximizing H(s f |s 0 ) and (2) minimizing H(s f |Ω, s 0 ). While explicit VIC needs to learn both (1) and ( 2), implicit VIC needs to learn only (1), since (2) is already achieved by the definition of option. This makes the learning of implicit VIC easier and faster. Moreover, implicit VIC is scalable as explained in Section 1. Despite these strengths, implicit VIC suffers from variational bias in intrinsic reward under stochastic environments that can seriously degrade the empowerment. We derive this variational bias by decomposing p(Ω|s 0 ) and p(Ω|s f , s 0 ). Since implicit VIC defines the option Ω as the trajectory of an agent, i.e.,

