MUTUAL INFORMATION STATE INTRINSIC CONTROL

Abstract

Reinforcement learning has been shown to be highly successful at many challenging tasks. However, success heavily relies on well-shaped rewards. Intrinsically motivated RL attempts to remove this constraint by defining an intrinsic reward function. Motivated by the self-consciousness concept in psychology, we make a natural assumption that the agent knows what constitutes itself, and propose a new intrinsic objective that encourages the agent to have maximum control on the environment. We mathematically formalize this reward as the mutual information between the agent state and the surrounding state under the current agent policy. With this new intrinsic motivation, we are able to outperform previous methods, including being able to complete the pick-and-place task for the first time without using any task reward. A video showing experimental results is available at https://youtu.be/AUCwc9RThpk.

1. INTRODUCTION

Reinforcement learning (RL) allows an agent to learn meaningful skills by interacting with an environment and optimizing some reward function, provided by the environment. Although RL has achieved impressive achievements on various tasks (Silver et al., 2017; Mnih et al., 2015; Berner et al., 2019) , it is very expensive to provide dense rewards for every task we want the robot to learn. Intrinsically motivated reinforcement learning encourages the agent to explore by providing an "internal motivation" instead, such as curiosity (Schmidhuber, 1991; Pathak et al., 2017; Burda et al., 2018 ), diversity (Gregor et al., 2016; Haarnoja et al., 2018; Eysenbach et al., 2019) and empowerment (Klyubin et al., 2005; Salge et al., 2014; Mohamed & Rezende, 2015) . Those internal motivations can be computed on the fly when the agent is interacting with the environment, without any human engineered reward. We hope to extract useful "skills" from those internally motivated agents, which could later be used to solve downstream tasks, or simply augment the sparse reward with those intrinsic rewards to solve a given task faster. Most of the previous works in RL model the environment as a Markov Decision Process (MDP). In an MDP, we use a single state vector to describe the current state of the whole environment, without explicitly distinguishing the agent itself from its surrounding. However, in the physical world, there is a clear boundary between an intelligent agent and its surrounding. The skin of any mammal is an example of such boundary. The separation of the agent and its surrounding also holds true for most of the man-made agents, such as any mechanical robot. This agent-surrounding separation has been studied for a long time in psychology under the concept of self-consciousness. Self-consciousness refers that a subject knows itself is the object of awareness (Smith, 2020), effectively treating the agent itself differently from everything else. Gallup (1970) has shown that self-consciousness widely exists in chimpanzees, dolphins, some elephants and human infants. To equally emphasize the agent and its surrounding, we name this separation as agent-surrounding separation in this paper. The widely adopted MDP formulation ignores the natural agent-surrounding separation, but simply stacks the agent state and its surrounding state together as a single state vector. Although this formulation is mathematically concise, we argue that it is over-simplistic, and as a result, it makes the learning harder. perform actions such that the resulting agent state should have high Mutual Information (MI) with the surrounding state. Intuitively, the higher the MI, the more control the agent could have on its surrounding. We name the proposed method "MUtual information-based State Intrinsic Control", or MUSIC in short. With the proposed MUSIC method, we are able to learn many complex skills in an unsupervised manner, such as learning to pick up an object without any task reward. We can also augment a sparse reward with the dense MUSIC intrinsic reward, to accelerate the learning process. Our contributions are three-fold. First, we propose a novel intrinsic motivation (MUSIC) that encourages the agent to have maximum control on its surrounding, based on the natural agentsurrounding separation assumption. Secondly, we propose scalable objectives that make the MUSIC intrinsic reward easy to optimize. Last but not least, we show MUSIC's superior performance, by comparing it with other competitive intrinsic rewards on multiple environments. Noticeably, with our method, for the first time the pick-and-place task can be solved without any task reward.

2. PRELIMINARIES

For environments, we consider four robotic tasks, including push, slide, pick-and-place, and navigation, as shown in Figure 2 . The goal in the manipulation task is to move the target object to a desired position. For the navigation task, the goal is to navigate to a target ball. In the following, we define some terminologies.

2.1. AGENT STATE, SURROUNDING STATE, AND REINFORCEMENT LEARNING SETTINGS

In this paper, the agent state s a means literally the state variable of the agent. The surrounding state s s refers to the state variable that describes the surrounding of the agent, for example, the state variable of an object. For multi-goal environments, we use the same assumption as previous works (Andrychowicz et al., 2017; Plappert et al., 2018) , which consider that the goals can be represented as states and we denote the goal variable as g. For example, in the manipulation task, a goal is a particular desired position of the object in the episode. These desired positions, i.e., goals, are sampled from the environment. The division between the agent state and the surrounding state is naturally defined by the agentsurrounding separation concept introduced in Section 1. From a biology point of view, a human can naturally distinguish its own parts, like hands or legs from the environments. Analog to this, when we design a robotic system, we can easily know what is the agent state and what is its surrounding state. In this paper, we use upper letters, such as S, to denote random variables and the corresponding lower case letter, such as s, to represent the values of random variables. We assume the world is fully observable, including a set of states S, a set of actions A, a distribution of initial states p(s 0 ), transition probabilities p(s t+1 | s t , a t ), a reward function r: S × A → R, and a discount factor γ ∈ [0, 1]. These components formulate a Markov Decision Process represented as a tuple, (S, A, p, r, γ). We use τ to denote a trajectory, which contains a series of agent states and surrounding states. Its random variable is denoted as T .

3. METHOD

We focus on agent learning to control its surrounding purely by using its observations and actions without supervision. Motivated by the idea that when an agent takes control of its surrounding, then there is a high MI between the agent state and the surrounding state, we formulate the problem of learning without external supervision as one of learning a policy π θ (a t | s t ) with parameters θ to maximize intrinsic MI rewards, r = I(S a ; S s ). In this section, we formally describe our method, mutual information-based state intrinsic control (MUSIC).

3.1. MUTUAL INFORMATION REWARD FUNCTION

Our framework simultaneously learns a policy and an intrinsic reward function by maximizing the MI between the surrounding state and the agent state. Mathematically, the MI between the surround-

