EFFICIENT EMPOWERMENT ESTIMATION FOR UNSUPERVISED STABILIZATION

Abstract

Intrinsically motivated artificial agents learn advantageous behavior without externally-provided rewards. Previously, it was shown that maximizing mutual information between agent actuators and future states, known as the empowerment principle, enables unsupervised stabilization of dynamical systems at upright positions, which is a prototypical intrinsically motivated behavior for upright standing and walking. This follows from the coincidence between the objective of stabilization and the objective of empowerment. Unfortunately, sample-based estimation of this kind of mutual information is challenging. Recently, various variational lower bounds (VLBs) on empowerment have been proposed as solutions; however, they are often biased, unstable in training, and have high sample complexity. In this work, we propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel, which allows us to efficiently calculate an unbiased estimator of empowerment by convex optimization. We demonstrate our solution for sample-based unsupervised stabilization on different dynamical control systems and show the advantages of our method by comparing it to the existing VLB approaches. Specifically, we show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images. Consequently, our method opens a path to wider and easier adoption of empowerment for various applications. 1 

1. INTRODUCTION

Intrinsic motivation allows artificial and biological agents to acquire useful behaviours without external knowledge (Barto et al. (2004) ; Chentanez et al. (2005) ; Schmidhuber (2010); Barto (2013) ; Oudeyer et al. (2016) ). In the framework of reinforcement learning (RL), this external knowledge is usually provided by an expert through a task-specific reward, which is optimized by an artificial agent towards a desired behavior (Mnih et al. (2013) ; Schulman et al. (2017) ). In contrast, an intrinsic reward can arise solely from the interaction between the agent and environment, which eliminates the need for domain knowledge and reward engineering in some settings (Mohamed & Rezende (2015); Houthooft et al. (2017); Pathak et al. (2017) ). Previously, it was shown that maximizing mutual information (Cover & Thomas (2012) ) between an agent's actuators and sensors can guide the agent towards states in the environment with higher potential to achieve a larger number of eventual goals (Klyubin et al. (2005) ; Wissner-Gross & Freer (2013)). Maximizing this kind of mutual information is known as the empowerment principle (Klyubin et al. (2005); Salge et al. (2014) ). Previously, it was found that an agent maximizing its empowerment converges to an unstable equilibrium of the environment in various dynamical control systems (Jung et al. (2011); Salge et al. (2013); Karl et al. (2019) ). In this application, ascending the gradient of the empowerment function coincides with the objective of optimal control for stabilization at an unstable equilibrium (Strogatz ( 2018)), which is an important task for both engineering (Todorov ( 2006)) and biologicalfoot_1 systems; we refer to this as the essential property of empowerment. It follows from the aforementioned prior works that a plausible estimate of the empowerment function should possess this essential property. Empowerment has been found to be useful for a broad spectrum of applications, including: unsupervised skill discovery (Sharma et al. ( 2020 In this work, we introduce a new method, Latent Gaussian Channel Empowerment (Latent-GCE), for empowerment estimation and utilize the above-mentioned property as an "indicator" for the quality of the estimation. Specifically, we propose a particular representation for dynamical control systems using deep neural networks which is learned from state-action trajectories. This representation admits an efficient estimation of empowerment by convex optimization (Cover & Thomas (2012)), both from raw state and from images. We propose an algorithm for simultaneous estimation and maximization of empowerment using standard RL algorithms such as Proximal Policy Optimization (Schulman et al. ( 2017)), and Soft Actor-Critic (Haarnoja et al. ( 2018)). We test our method on the task of unsupervised stabilization of dynamical systems with solely intrinsic reward, showing our estimator exhibits essential properties of the empowerment function. We demonstrate the advantages of our method through comparisons to the existing state-of-the-art empowerment estimators in different dynamical systems from the OpenAI Gym simulator (Brockman et al. ( 2016)). We find that our method (i) has a lower sample complexity, (ii) is more stable in training, (iii) possesses the essential properties of the empowerment function, and (iv) allows us to accurately estimate empowerment from images. We hope such a review of the existing methods for empowerment estimation will help push this research direction.

2. PRELIMINARIES

In this section, we review the necessary background for our method, consisting of the reinforcement learning setting, various empowerment estimators, and the Gaussian channel capacity. We also review the underlying components in relevant prior work which we use for comparison to our method.

2.1. REINFORCEMENT LEARNING

The reinforcement learning (RL) setting is modeled as an infinite-horizon Markov Decision Process (MDP) defined by: the state space S, the action space A, the transition probabilities p(s |s, a), the initial state distribution p 0 (s), the reward function r(s, a) ∈ R, and the discount factor γ. The goal of RL is to find an optimal control policy π(a|s) that maximizes the expected return, i.e. max π E s0∼p0,at∼π,st+1∼p ∞ t=0 γ t r(s t , a t ) (Sutton & Barto (2018)).

2.2. EMPOWERMENT

Interaction between an agent and its environment is plausibly described by the perception-action cycle (PAC), (Tishby & Polani (2011)) , where the agent observes the state of the environment via its sensors and responds with its actuators. The maximal information rate from actuators to sensors is an inherent property of PAC which characterizes the empowerment of the agent, formally defined below. Empowerment (Klyubin et al. (2005) ) is defined by the maximal mutual information rate (Cover & Thomas (2012) ) between the agent's sensor observations o ∈ O and actuators a ∈ A given the current state s ∈ S. It is fully specified by a fixed probability distribution of sensor observations conditioned



Project page: https://sites.google.com/view/latent-gce Broadly speaking, an increase in the rate of information flow in the perception-action loop (Tishby & Polani (2011)) could be an impetus for the development of homosapiens, as hypothesized in (Yuval (2014)).



); Eysenbach et al. (2019); Gregor et al. (2017); Karl et al. (2019); Campos et al. (2020); human-agent coordination (Salge & Polani (2017); Guckelsberger et al. (2016)); assistance (Du et al. (2021)); and stabilization (Tiomkin et al. (2017)). Past work has utilized variational lower bounds (VLBs) (Poole et al. (2019); Alemi et al. (2017); Tschannen et al. (2020); Mohamed & Rezende (2015)) to achieve an estimate of the empowerment. However, VLB approaches to empowerment in dynamical control systems (Sharma et al. (2020); Achiam et al. (2018)) have high sample complexity, are often unstable in training, and may be biased. Moreover, it was not previously studied if empowerment estimators learned via VLBs possess the essential properties of empowerment.

