EFFICIENT EMPOWERMENT ESTIMATION FOR UNSUPERVISED STABILIZATION

Abstract

Intrinsically motivated artificial agents learn advantageous behavior without externally-provided rewards. Previously, it was shown that maximizing mutual information between agent actuators and future states, known as the empowerment principle, enables unsupervised stabilization of dynamical systems at upright positions, which is a prototypical intrinsically motivated behavior for upright standing and walking. This follows from the coincidence between the objective of stabilization and the objective of empowerment. Unfortunately, sample-based estimation of this kind of mutual information is challenging. Recently, various variational lower bounds (VLBs) on empowerment have been proposed as solutions; however, they are often biased, unstable in training, and have high sample complexity. In this work, we propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel, which allows us to efficiently calculate an unbiased estimator of empowerment by convex optimization. We demonstrate our solution for sample-based unsupervised stabilization on different dynamical control systems and show the advantages of our method by comparing it to the existing VLB approaches. Specifically, we show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images. Consequently, our method opens a path to wider and easier adoption of empowerment for various applications. 1 

1. INTRODUCTION

Intrinsic motivation allows artificial and biological agents to acquire useful behaviours without external knowledge (Barto et al. (2004) ; Chentanez et al. (2005) ; Schmidhuber (2010) ; Barto (2013) ; Oudeyer et al. (2016) ). In the framework of reinforcement learning (RL), this external knowledge is usually provided by an expert through a task-specific reward, which is optimized by an artificial agent towards a desired behavior (Mnih et al. (2013) ; Schulman et al. (2017) ). In contrast, an intrinsic reward can arise solely from the interaction between the agent and environment, which eliminates the need for domain knowledge and reward engineering in some settings (Mohamed & Rezende (2015) ; Houthooft et al. (2017) ; Pathak et al. (2017) ). Previously, it was shown that maximizing mutual information (Cover & Thomas (2012) ) between an agent's actuators and sensors can guide the agent towards states in the environment with higher potential to achieve a larger number of eventual goals (Klyubin et al. (2005) ; Wissner-Gross & Freer (2013) ). Maximizing this kind of mutual information is known as the empowerment principle (Klyubin et al. (2005) ; Salge et al. (2014) ). Previously, it was found that an agent maximizing its empowerment converges to an unstable equilibrium of the environment in various dynamical control systems (Jung et al. (2011) ; Salge et al. (2013) ; Karl et al. (2019) ). In this application, ascending the gradient of the empowerment function coincides with the objective of optimal control for stabilization at an unstable equilibrium (Strogatz (2018)), which is an important task for both engineering (Todorov (2006) ) and biologicalfoot_1 systems; we refer to this as the essential property of empowerment. It follows from the aforementioned prior works that a plausible estimate of the empowerment function should possess this essential property. Empowerment has been found to be useful for a broad spectrum of applications, including: unsupervised skill discovery (Sharma et al. (2020) ; Eysenbach et al. (2019) ; Gregor et al. (2017) ; Karl et al. (2019) ; Campos et al. (2020) ; human-agent coordination (Salge & Polani (2017) ; Guckelsberger et al. (2016) ); assistance (Du et al. (2021) ); and stabilization (Tiomkin et al. (2017) ). Past work has utilized variational lower bounds (VLBs) (Poole et al. (2019) ; Alemi et al. (2017) ; Tschannen et al. (2020) ; Mohamed & Rezende (2015) ) to achieve an estimate of the empowerment. However, VLB approaches to empowerment in dynamical control systems (Sharma et al. (2020) ; Achiam et al. (2018) ) have high sample complexity, are often unstable in training, and may be biased. Moreover, it was not previously studied if empowerment estimators learned via VLBs possess the essential properties of empowerment. In this work, we introduce a new method, Latent Gaussian Channel Empowerment (Latent-GCE), for empowerment estimation and utilize the above-mentioned property as an "indicator" for the quality of the estimation. Specifically, we propose a particular representation for dynamical control systems using deep neural networks which is learned from state-action trajectories. This representation admits an efficient estimation of empowerment by convex optimization (Cover & Thomas (2012) ), both from raw state and from images. We propose an algorithm for simultaneous estimation and maximization of empowerment using standard RL algorithms such as Proximal Policy Optimization (Schulman et al. (2017) ), and Soft Actor-Critic (Haarnoja et al. (2018) ). We test our method on the task of unsupervised stabilization of dynamical systems with solely intrinsic reward, showing our estimator exhibits essential properties of the empowerment function. We demonstrate the advantages of our method through comparisons to the existing state-of-the-art empowerment estimators in different dynamical systems from the OpenAI Gym simulator (Brockman et al. (2016) ). We find that our method (i) has a lower sample complexity, (ii) is more stable in training, (iii) possesses the essential properties of the empowerment function, and (iv) allows us to accurately estimate empowerment from images. We hope such a review of the existing methods for empowerment estimation will help push this research direction.

2. PRELIMINARIES

In this section, we review the necessary background for our method, consisting of the reinforcement learning setting, various empowerment estimators, and the Gaussian channel capacity. We also review the underlying components in relevant prior work which we use for comparison to our method.

2.1. REINFORCEMENT LEARNING

The reinforcement learning (RL) setting is modeled as an infinite-horizon Markov Decision Process (MDP) defined by: the state space S, the action space A, the transition probabilities p(s |s, a), the initial state distribution p 0 (s), the reward function r(s, a) ∈ R, and the discount factor γ. The goal of RL is to find an optimal control policy π(a|s) that maximizes the expected return, i.e. max π E s0∼p0,at∼π,st+1∼p ∞ t=0 γ t r(s t , a t ) (Sutton & Barto (2018)).

2.2. EMPOWERMENT

Interaction between an agent and its environment is plausibly described by the perception-action cycle (PAC), (Tishby & Polani (2011)) , where the agent observes the state of the environment via its sensors and responds with its actuators. The maximal information rate from actuators to sensors is an inherent property of PAC which characterizes the empowerment of the agent, formally defined below. Empowerment (Klyubin et al. (2005) ) is defined by the maximal mutual information rate (Cover & Thomas (2012) ) between the agent's sensor observations o ∈ O and actuators a ∈ A given the current state s ∈ S. It is fully specified by a fixed probability distribution of sensor observations conditioned on the actuators and the state, p(O | A, s), denoted by "channel", and by a free probability of the actuator signals conditioned on the state, ω(A | s), denoted by "source". The empowerment E is the capacity of the channel, which is found via optimization of the source distribution ω(A | s): E(s) = max ω(A|s) I[O; A | s] = max ω(A|s) O,A p(O, A | s) log p(A | O, s) ω(A | s) where I[O; A | s] is the mutual information, reflecting the difference between the entropy of the sensor inputs, H(O | s) and the corresponding conditional entropy, H(O | A, s)foot_2 . In this work, we follow the original empowerment formulation (Klyubin et al. (2005) ), where A is an action sequence of length T starting from the state s and O is the resulting observation sequence. However, our method is applicable to other choices of observation and control sequences. Remark 1 The source distribution in Equation 1, ω * (A | s), is used only for the estimation of empowerment. Another policy, π(a | s), is calculated using empowerment as an intrinsic reward by either reinforcement learning (Karl et al. (2019) ) or a 1-step greedy algorithm (Jung et al. (2011) ; Salge et al. (2013) ) to reach the states with maximal empowerment. Strictly speaking, the source distribution ω maximizes the mutual information, (i.e. estimates the empowerment), while the policy distribution π maximizes expected accumulated empowerment along trajectories. In general, the capacity of an arbitrary channel for a continuous random variable is unknown except for a few special cases such as the Gaussian linear channel, as reviewed below. As a result, past works have largely relied on variational lower bounds instead.

2.2.1. EMPOWERMENT BY VARIATIONAL LOWER BOUNDS

The mutual information in Equation 1 can be bounded from below as follows (Mohamed & Rezende (2015) ; Gregor et al. (2017) ; Karl et al. (2019) ; Poole et al. (2019) ): O,A p(O, A | s) log p(A | O, s) ω(A | s) = O,A p(O, A | s) log q(A | O, s) ω(A | s) + DKL [p(A | O, s) || q(A | O, s)] ≤ O,A p(O, A | s) log q(A | O, s) ω(A | s) where q(A | O, s) and ω(A | s) are represented by neural networks with parameters φ and ψ respectively. The lower bound in Equation 2 can be estimated from samples using reinforcement learning with intrinsic reward r(s t , a t ) = T t log q φ (a t | o t , s) -log ω ψ (a t | s t ), as detailed in (Gregor et al. (2017) ; Karl et al. (2019) ). To apply this variational lower bound on empowerment to unsupervised stabilization of dynamical systems one needs to learn three distributions: q φ (a t | o t , s) and ω ψ (a t | s t ) for the lower bound in Equation 2, and π(a | s) as the control policy (see Remark 1). Developing the mutual information in Equation 1 in the opposite direction, one gets another lower bound on empowerment, an approximation of which was used in (Sharma et al. (2020) ): O,A p(O, A | s) log p(O | A, s) ω(O | s) = O,A p(O, A | s) log q(O | A, s) ω(O | s) + DKL [p(O | A, s) || q(O | A, s)] ≤ O,A p(O, A | s) log q(O | A, s) ω(O | s) . ( ) For sake of the clarity of the exposition, the above-explained variational empowerment is schematically depicted in Figure 1 . The controller policy, π θ (a | s), collects transitions in environment, (s, a, s ). The source policy ω ψ (a | s) generates the action a at every state s visited by the controller policy, and the environment transits to the next state, s (yellow block). The inverse channel policy q φ (a | s, s ) is regressed to a given s and and s , generating the intrinsic reward, C = log ω ψ (a | s) -log q φ (a | s, s ), which is used as the reward to improve the controller policy via reinforcement learning. This scheme aims to transfer the intrinsically motivated agent to the state of the maximal empowerment in the environment.

2.2.2. EMPOWERMENT FOR GAUSSIAN CHANNELS

The Gaussian linear channel in our case is given by: O = G(s) • A + η, where O and A are the observation and actuator sequences with dimensions 2013), we assume the noise is distributed as a Gaussian. For completeness, Appendix A provides the analytical derivation of G(s) in the case of known dynamics for pendulum for the action sequence A with length 3. In contrast, the current work proposes a method how to estimate G(s) for arbitrary dynamics and arbitrary length of the action sequence from samples. The details of the calculation of Gaussian channel capacity are included in Appendix B. d O × 1 and d A × 1, respectively; G(s) is a matrix with dimensions d O × d A ; and η is multivariate Gaussian noise with dimension d O ×1. Following Salge et al. ( The key observation in our method is that we can learn G(s) from samples, which allows us to estimate empowerment efficiently for arbitrary dynamics, including for high-dimensional dynamical systems or from images (pixels). As a result, it is not required to resort to lower bounds as done in prior work. In Section 4, we describe how to learn G(s) from samples using deep neural networks.

3. RELATED WORK

In this section, we review the relevant prior work and state of the art in empowerment estimation, emphasizing the main differences between these existing works and our proposed approach. Estimation via Gaussian channel. The representation given by Equation 4 was previously considered in (Salge et al. (2013) ), where the channel p(O | A, s) was assumed to be known. In our current work, we remove this assumption and extend the applicability of the efficient channel capacity solution given by Equation 11to arbitrary observations, including images. Variational approaches to estimation. 2017)) precede that by (Karl et al. (2019) ), and we also experimentally find the latter to be more stable. Consequently, in Section 5 we provide comparisons to Karl et al. (2019) , which represents the current start of the art in this line of research. Skill discovery. An approximation of the lower bound in Equation 3 is successfully used by Sharma et al. (2020) in the context of learning a set of skills. Another relevant and important work is Eysenbach et al. (2019) . Citing the authors, their work is about "maximizing the empowerment of a hierarchical agent whose action space is the set of skills". Section 5 provides comparisons between our empowerment estimator and these methods with regards to stability in training, sample efficiency, and correctness of the empowerment landscape generated by both methods. Although they aim to achieve a different goal from our work -unsupervised skill discovery vs unsupervised stabilizationtheir objectives are the same type us ours: maximizing mutual information between actuators and sensors. In Section 5 we show that the empowerment landscape derived by these methods converge to the true empowerment landscape in simple cases, but fail to converge to a known empowerment landscape in more complicated dynamical systems. These works represent the state of the art in empowerment estimation. We believe a thorough comparison between our method and these works, as provided in Section 5, clearly demonstrate the main properties of our method and clarify our contribution to the improvement of sample-based empowerment estimation in dynamical systems. As mentioned in the introduction, empowerment is useful for a broad spectrum of different applications. In this section, we highlighted the most relevant and updated methods, which we hope are helpful to see the merits and advances of our approach. Our improved empowerment estimation possesses the essential properties of the empowerment function: in particular, it converges to the correct empowerment landscape. The training has improved stability. Furthermore, it allows for the estimation of the empowerment function from images, which we demonstrate in Section 5, where we show purely intrinsically motivated (empowerment-based) stabilization of inverted pendulum from images. Accurate estimation of empowerment from images significantly expands the possible breadth of applications for empowerment.

4. PROPOSED METHOD

The desiderata for our method as follows: • to learn the channel matrix (G(s) in Equation 4 from samples for arbitrary dynamics. • to utilize general RL optimizers to reliably train intrinsically motivated policies. • to enable empowerment maximization from arbitrary dynamics using latent representations. These are desirable as: the ability to learn G(s) from samples would allow us to estimate empowerment by the efficient line search algorithm in Equation 12, recent advances in RL algorithms can alleviate challenges in stability and sample efficiency, and latent representations would allow our method to be applicable even when the observation or underlying dynamics do not satisfy the Gaussian Linear Channel assumption. In this work, we present Latent Gaussian Channel Empowerment, or Latent-GCE in short. Schematically, our method is depicted in Figure 2 . The controller policy π θ (a | s) collects trajectories in the environment, (s, a, s ), where s is either a state or image observation. The channel matrix G χ (s t ) predicts the final state of a trajectory corresponding to the action sequence a T -1 t , starting from s t . After performing singular value decomposition on G χ (s t ), the empowerment value E(s t ) is estimated by Equation 11 and Equation 12. The empowerment along trajectories provide intrinsic rewards, denoted by r (τ ), for the update of the controller policy, π θ , using off-the-shelf RL algorithms. Our method is summarized in high level in Algorithm 1.

4.1. EMBEDDING OF OBSERVATIONS AND ACTION SEQUENCES

While the Gaussian Linear Channel defined in Equation 4 provides a good estimate of simple dynamic systems, we want our empowerment estimator to generalize to arbitrary dynamics, including highdimensional systems, and environments where the observations are images. Using autoencoders, we embed the observations and action sequences such that their latent vector representations satisfy the Gaussian Linear Channel property. We use CNN encoder and decoder for image observations, or fully-connected networks for state observations. The encoder-decoder pair for action sequences are also parameterized by fully-connected networks. Let f χ , f -1 χ , g χ , g -1 χ denote the encoder-decoder pairs for the observations and action sequences, f χ maps the observations into d f dimensional latent for j = 1 to M do 6: χ * ← arg min χ tuples G χ (s t ) • a H t -s t+H 2 2 {projection to Equation 4}7: end for 8: {σ * i (s t )} k i=1 ← SVD(G χ * (s t )) {an alternative explained in Appendix C} 9: E(s t ) ← 1 2 k i=1 log(1 + σ * i (s)p * i ) {Equation 11 and Equation 12}10: reinforce π θi (a | s) with r (τ ) = T t=1 γ t-1 E(s t ). {policy update} 11: end for states, and g χ maps the H-step action sequences into d g dimensional latent actions. Our Latent Gaussian Linear Channel is formulated as follows: f χ (s t+H ) = G χ (f χ (s t )) • g χ (a H t ) + η (5)

4.2. LEARNING CHANNEL MATRIX G

The channel matrix G χ (f χ (s t )) has dimensions d f ×d g , where d f and d g are the dimensions of latent states and actions as defined in section 4.1. In the Latent Gaussian Linear Channel, as formulated in Equation 5, the ending latent state is approximated by a matrix-vector multiplication between the channel matrix G χ (f χ (s t )) and the latent action g χ (a H t ). The mapping from a latent state to its channel matrix is parameterized by a 3-layer fully-connected neural network. The output layer of the network contains d f • d g neurons, which is then reshaped into the d f × d g matrix.

4.3. OVERALL OPTIMIZATION OBJECTIVES

The overall training objective for our Latent-GCE estimation consists of two parts: reconstruction loss and latent prediction loss. The reconstruction loss L R is the average L2 distance between the decoded latent representations and the original encoder inputs. Minimizing L R encourages the autoencoders to learn meaningful latent representations. The prediction loss L P is the average L2 distance between the matrix-vector product and the true latent ending state. Minimizing L P forces the latent space to maintain the Gaussian Linear Channel structure: L(χ) = L R + L P = E s f -1 χ (f χ (s)) -s 2 2 + E a H t g -1 χ (g χ (a H t )) -a H t 2 2 + E (st,a H t ,s t+H ) G χ (f χ (s t )) • g χ (a H t ) -f χ (s t+H ) 2 2 (6) Overall, our Latent-GCE Maximization scheme (Figure 2 ) trains two sets of parameters, (θ, χ), while VLB methods (Figure 1 ) generally needs three sets, (θ, φ, ψ). The reduced trainable components suggest that our method might be superior in terms of better stability and quicker convergence, which we confirm in our experiment section.

5.1. UNSUPERVISED STABILIZATION OF CLASSIC DYNAMIC SYSTEMS

As discussed in Section 1, a control policy that maximizes agent empowerment should lead to agent stabilization, even at an unstable equilibrium. To evaluate this, we compare our method with two prior works by Mohamed & Rezende (2015) and Karl et al. (2019) on their ability to stabilize simple dynamic systems. Specifically, in the 2D ball-in-box environment, empowerment is highest at the center of the box, where the ball is free to move in all directions; in the pendulum environment, empowerment is highest at the up-right position, where gravity assists in reaching other states. We compare average squared distance away from the stable position during policy training. As shown in Figure 3 , all three methods are successful on the simple ball-in-box environment, whereas only our method consistently stabilizes the pendulum at an unstable equilibrium. We refer readers to Appendix D for detailed visualization of pendulum empowerment/policy learning using our method, and Appendix E for results on cart-pole. Figure 3 : Evaluating whether methods learn to stabilize simple dynamical systems. All methods succeed for ball-in-box, whereas only our method succeeds for pendulum.

5.2. QUALITATIVE CONVERGENCE OF DIFFERENT ESTIMATORS

We now qualitatively compare the convergence of different empowerment estimators to the true empowerment function. Not only do we include solutions by VLB methods that approximate the classic empowerment definition (Mohamed & Rezende (2015) ; Karl et al. (2019) ), but we also compare with recent works that employ empowerment-like objectives for skill discovery (Eysenbach et al. (2019) ; Sharma et al. ( 2020)). From Figure 4 , we observe that in simple environments such as ball-in-box, all estimators converge to the correct landscape with high empowerment values in the middle and low values near the boundaries, corresponding to previous work (Klyubin et al. (2005) ). For the pendulum environment, it is clear that while both the method by Karl et al. (2019) and ours succeed in creating results similar to the analytical solution (the ground truth), ours outputs an empowerment landscape that is smoother and more discerning. Detailed derivation of the analytical solution is included in Appendix A.

5.3. STABILITY OF EMPOWERMENT ESTIMATIONS

Besides the convergence to correct empowerment values, the stability and sample efficiency are also key aspects to determine whether the empowerment estimators are suitable as reliable intrinsic reward signals. We compute the variance of average empowerment values over 10 seeds as a measurement of algorithm stability. Since the outputs by existing methods are on different scales, we instead compare the relative standard deviation. As shown in Figure 5 , our method outperforms the existing sample-based methods in that it has lower variance in training and faster convergence to the final estimator. These properties make our method competent as a reliable unsupervised training signal. 

5.4. EMPOWERMENT ESTIMATION IN HIGH-DIMENSIONAL SYSTEMS

We verify the applicability of Latent-GCE in high-dimensional systems from two aspects: first, the latent space should be invertible -able to predict the true underlying dynamics; moreover, the empowerment maximization policy should develop stabilizing behaviors.

5.4.1. PIXEL-BASED PENDULUM

We use a pendulum environment that supplies 64 × 64 images as observations. As input to our Latent-GCE algorithm, we concatenate two consecutive frames so as to capture the angular velocity. In Figure 6 , we start the pendulum at its up-right position s 0 . Using the encoder f χ , we map the images into the latent representation f χ (s 0 ). Given different action sequences a H 0 , we predict the ending latent state as z t+H = G χ (f χ (s t )) • g χ (a H t ). If our learned Latent-GCE model captures the true system dynamics, we expect the reconstruction of the ending state f -1 χ (z t+H ) to match the ground truth s t+H . This property holds true as verified in Figure 6 . More importantly, as shown in Figure 7 , the empowerment estimations given by our method is not only close to the analytical solution, but also trains a policy that swing the pendulum up from bottom to balance at the top. On the other hand, VIM by Mohamed & Rezende (2015) which we compare as a baseline, fails to assign a high empowerment estimation for the up-right position.

5.4.2. HOPPER

We evaluate the ability of Latent-GCE to stabilize the Hopper environment from OpenAI Gym. This environment is more challenging because of higher state dimension and long horizon (64) for empowerment estimation. Here we expect an empowerment maximization policy to balance the hopper at its standing position where it is able to move left and right freely. In Figure 8 we see our method successfully balances hopper whereas a random policy inevitably makes the hopper fall over. 

6. SUMMARY AND DISCUSSION

In this work we introduce a new sample-based method for empowerment estimation in dynamical systems. The key component of our method is the representation of dynamical system by the Gaussian channel, which allows us to estimate empowerment by convex optimization. Our estimator converges to true empowerment landscape both from raw states and images. Empowerment is used in various applications and our improved estimator has a potential to improve these applications. For example in Appendix F, we show that when the agent is trained with an empowerment-aware objective, it develops distinguishable levels of attention to safety. Moreover, a natural continuation of this work is to learn skills using our representation of empowerment by Gaussian channel, rather than by VLB. Figure 9 : Analytical Empowerment Landscape where σ i (s), p i , k, and P , are the i-th singular value of G(s), the corresponding actuator signal power, the number of nonzero singular values, and the total power of the actuator signal, respectively. The optimal solution to Equation 11 is given by: p * i (ν) = max(1/ν -1/σ i (s), 0), k i p * i (ν) = P, ( ) where ν is found by a line search (Cover & Thomas (2012) ) to satisfy the total power constraint.

C SVD ALTERNATIVE

The singular values, σ i (s t ), can be estimated directly without the singular values decomposition by representing the channel matrix as G χ (s t ) = U χ (s t ) * Diagm(Σ χ (s t )) * V χ (s t ) with two additional orthonormality constraints: U χ (s t )U † χ (s t ) -I {ds×ds} 2 2 and V χ (s t )V † χ (s t ) -I {da(H-1)×da(H-1)} 2 2 . In our experiments, we did not experience a bottleneck in calculating the singular values decomposition for both raw states and images. While we don't implement this in our experiments, this alternative direct calculation can be applied when SVD is undesired.

D TWO PHASES IN EMPOWERMENT-BASED STABILIZATION

Empowerment of the pendulum is highest when it is stationary at the up-right position. In this section, we show the evolution of the empowerment landscape, the corresponding control policy and the state distribution as we run Algorithm 1 without any external reward. In Figure 10 , we observe that Algorithm 1 includes two phases: empowerment learning and policy learning. Empowerment converges from epoch 0 to 10 and then, policy converges from epoch 10 to 50.

E CART-POLE SWING-UP

The cart-pole environment is made up of a stiff pole with one end fixed to a pivot on a cart movable along one axis. The cart can be pushed left or right within a confined region. The state space of cart-pole has 4 components: cart position, cart velocity, pole angle, pole angular velocity. We expect high empowerment when the pole is up right, because starting there, gravity is able to help the system reach a broader set of future states. To measure whether Algorithm 1 successfully brings the cart-pole to up-right position, we plot the average squared angle to the top against the number of training steps. We also include 2 different reward functions that directly aim this objective for comparison. As shown in Figure 11 , the empowerment reward, without the need of prior knowledge of the environment, is able to achieve similar final performance compared to direct reward signals. Moreover, it learns this objective at a speed on par with the dense reward. 

F EMPOWERMENT ESTIMATION FOR SAFETY

A useful application of empowerment is its implication of safety for the artificial agent. A state is intrinsically safe for an agent when the agent has a high diversity of future states, achievable by its actions. This is because in such states, the agent can take effective actions to prevent undesirable futures. In this context, the higher its empowerment value, the safer the agent is. In this experiment, we first check that our calculation of empowerment matches the specific design of the environment. Additionally, we show that empowerment augmented reward function can affect the agent's preference between a shorter but more dangerous path and a longer but safer one. Environment: a double tunnel environment implemented with to the OpenAI Gym API (Brockman et al., 2016) . Agent (blue) is modeled as a ball of radius 1 inside a 20×20 box. The box is separated by a thick wall (gray) into top and bottom section. Two tunnels connect the top and bottom of the box. The tunnel in middle is narrower but closer to the goal compared to the one on the right.

Control:

In each time step, the agent can move at most 0.5 unit length in each of x and y direction. If an action takes the agent into the walls, it will be shifted back out to the nearest surface. Reset criterion: each episode has a maximum length of 200 steps. The environment resets when time runs out or when the agent reaches the goal. Since the tunnel in the middle is narrower, the agent is relatively less safe there. The effectiveness of the control of the agent is damped in 2 ways: 1. In a particular time step, it's more likely for the agent to bump into the walls. When this happens, the agent is unable to proceed as far as desired. 2. The tunnel is 10 units in length. When the agent is around the middle, it will still be inside the tunnel in the 5-step horizon. Thus, the agent has fewer possible future states. We trained the agent with PPO algorithm (Schulman et al., 2017) from OpenAI baselines (Dhariwal et al., 2017) using an empowerment augmented reward function. After a parameter search, we used discount factor γ = 0.95 over a total of 10 6 steps. The reward function that we choose is: R(s, a) = 1 goal + β × Emp(s) where β balances the relative weight between the goal conditioned reward and the intrinsic safety reward. With high β, we expect the agent to learn a more conservative policy. The results from the tunnel environment again support our proposed scheme. First, the empowerment landscape matches our design of the environment. Second, high quality empowerment reward successfully alters the behavior of the agent.



Project page: https://sites.google.com/view/latent-gce Broadly speaking, an increase in the rate of information flow in the perception-action loop(Tishby & Polani (2011)) could be an impetus for the development of homosapiens, as hypothesized in(Yuval (2014)). when the agent does not have an access to the true state, s, empowerment is calculated as the maximal mutual information between actuators and sensors given an estimated state.



Figure 1: Estimation of empowerment via variational lower bounds.

Figure 2: Empowerment maximization via Latent-GCE

Figure 4: Empowerment landscapes (normalized) for ball-in-box (top) and pendulum (bottom). In ball-in-box, central positions correspond to high empowerment. For pendulum, upright positions correspond to high empowerment; x-axis is the angle to top, y-axis is the angular velocity. From left to right, the plots show: ours, Mohamed & Rezende (2015), Karl et al. (2019), Eysenbach et al. (2019), Sharma et al. (2020).

Figure 5: Relative standard deviation (RSD) in average empowerment value across 10 seeds. Our method converges faster to a lower RSD during training.

Figure 6: Reconstruction in the latent dynamics vs the ground truth. Starting from the same upright position, a different action sequence is taken in each row (action +1: counter-clockwise torque, 0: no torque, -1: clockwise torque). Reconstruction matches the actual observation in all cases.

Figure 7: Empowerment estimates for pixel-based pendulum. Latent-GCE learns a reasonable empowerment function from pixels. Left: Analytical (x axis shifted by π from Figure 4). Middle: Latent-GCE along with the swinging-up trajectory of the final control policy. Right: VIM.

Figure 8: Average torso height over an evaluation episode of length 200 for the Hopper environment. Our method stabilizes the hopper to a standing position (z ≈ 1).

Figure10: Snapshots of the empowerment landscape (top), the control policy (middle) and the state concentration (bottom). The x-axis is θ (angle) and the y-axis is θ (angular velocity). The angle is measured from the upright position, (θ = 0 rad corresponds to the upright position). Each epoch corresponds to 2 • 10 4 steps. The policy corresponds to the optimal policy for the stabilization task with external reward(Doya (2000)).

Figure 11: Cart-Pole -Comparison of unsupervised and supervised reward functions. The "dense reward" is -θ 2 and the "sparse reward" is 1{|θ| < π 10 }.

Figure 12: Left sub-figure: Double tunnel environment. The goal is marked in red. The agent in blue. Right sub-figure: Empowerment landscape for the tunnel environment. The values of empowerment reduce at the corner and inside the tunnel where the control of the agent is less effective compared to more open locations.

Figure 13: Trajectories of trained policy. As β increase, the agent develops a stronger preference over a safer route, sacrificing hitting time.

ACKNOWLEDGMENTS

This research is supported by Berkeley Deep Drive, Intel, and Amazon through providing AWS cloud computing services.

A ANALYTICAL DERIVATION OF G(s) FOR INVERTED PENDULUM

The main steps in the derivation of G(s) appear below. The full derivation appears in Section 4.1 at Salge et al. (2013) . The current state of the pendulum, s t , is given by θ t and θt :where θ is measured from the upright position. We consider the development of the pendulum for three time steps of duration dt. The behaviour of the pendulum can be approximate linearly at every time step for small enough dt. Then, the next state of the pendulum is given by: where a t+dt is the action at the next step. Our goal is to represent s t+3dt by:which is achieved by linearisation θ t+3dt and θt+3dt around a t = 0. A straightforward calculation shows that the matrix G(s t ) is given by:This calculation can be continued beyond 3-steps, involving more complicated terms in the expansions. Given G(s t ) in Equation 10 one can represent s t+3dt by Equation 9, where the matrix K does not affect the mutual information Cover & Thomas (2012). Adding to Equation 9 the normal noise, η, as suggested in Salge et al. (2013) , we get the representation in Equation 4. In contrast, we propose to learn G(s) from samples via interaction with the environment for an arbitrary dynamics and an arbitrary number of steps (c.f. 3 steps in the above derivation). Using G(s) from Equation 10, we can calculate the analytical (ground truth empowerment landscape), which is shown at Figure 9 B GAUSSIAN CHANNEL CAPACITYThe capacity of the channel in Equation 4 is given by, (Cover & Thomas (2012) ):G EXPERIMENT DETAILS For our experiments using DADS, we use policy and discriminator architecture of two hidden layers of size 256 with ReLU activations and a latent dimension of 2. Our discount factor is .99. As per suggestions from the original paper, we use 512 prior samples to approximate the intrinsic reward, run 32 discriminator updates and 128 policy updates per 1000 timesteps; learning rates of 3 × 10 -4 and batch size of 256 are used throughout. We use a replay buffer to help stabilize training of size 20000, used only for policy updates.

G.4 HOPPER EXPERIMENT DETAILS

Each episode is set to a fixed length of 200 steps. We use PPO as the RL backbone for our control policy. The policy is updated every 1000 steps in the environment, with learning rate of 1e-4 and γ = 0.5.

