MUTUAL INFORMATION STATE INTRINSIC CONTROL

Abstract

Reinforcement learning has been shown to be highly successful at many challenging tasks. However, success heavily relies on well-shaped rewards. Intrinsically motivated RL attempts to remove this constraint by defining an intrinsic reward function. Motivated by the self-consciousness concept in psychology, we make a natural assumption that the agent knows what constitutes itself, and propose a new intrinsic objective that encourages the agent to have maximum control on the environment. We mathematically formalize this reward as the mutual information between the agent state and the surrounding state under the current agent policy. With this new intrinsic motivation, we are able to outperform previous methods, including being able to complete the pick-and-place task for the first time without using any task reward. A video showing experimental results is available at https://youtu.be/AUCwc9RThpk.

1. INTRODUCTION

Reinforcement learning (RL) allows an agent to learn meaningful skills by interacting with an environment and optimizing some reward function, provided by the environment. Although RL has achieved impressive achievements on various tasks (Silver et al., 2017; Mnih et al., 2015; Berner et al., 2019) , it is very expensive to provide dense rewards for every task we want the robot to learn. Intrinsically motivated reinforcement learning encourages the agent to explore by providing an "internal motivation" instead, such as curiosity (Schmidhuber, 1991; Pathak et al., 2017; Burda et al., 2018 ), diversity (Gregor et al., 2016; Haarnoja et al., 2018; Eysenbach et al., 2019) and empowerment (Klyubin et al., 2005; Salge et al., 2014; Mohamed & Rezende, 2015) . Those internal motivations can be computed on the fly when the agent is interacting with the environment, without any human engineered reward. We hope to extract useful "skills" from those internally motivated agents, which could later be used to solve downstream tasks, or simply augment the sparse reward with those intrinsic rewards to solve a given task faster. Most of the previous works in RL model the environment as a Markov Decision Process (MDP). In an MDP, we use a single state vector to describe the current state of the whole environment, without explicitly distinguishing the agent itself from its surrounding. However, in the physical world, there is a clear boundary between an intelligent agent and its surrounding. The skin of any mammal is an example of such boundary. The separation of the agent and its surrounding also holds true for most of the man-made agents, such as any mechanical robot. This agent-surrounding separation has been studied for a long time in psychology under the concept of self-consciousness. Self-consciousness refers that a subject knows itself is the object of awareness (Smith, 2020), effectively treating the agent itself differently from everything else. Gallup (1970) has shown that self-consciousness widely exists in chimpanzees, dolphins, some elephants and human infants. To equally emphasize the agent and its surrounding, we name this separation as agent-surrounding separation in this paper. The widely adopted MDP formulation ignores the natural agent-surrounding separation, but simply stacks the agent state and its surrounding state together as a single state vector. Although this formulation is mathematically concise, we argue that it is over-simplistic, and as a result, it makes the learning harder.

