SMIRL: SURPRISE MINIMIZING REINFORCEMENT LEARNING IN UNSTABLE ENVIRONMENTS

Abstract

Every living organism struggles against disruptive environmental forces to carve out and maintain an orderly niche. We propose that such a struggle to achieve and preserve order might offer a principle for the emergence of useful behaviors in artificial agents. We formalize this idea into an unsupervised reinforcement learning method called surprise minimizing reinforcement learning (SMiRL). SMiRL alternates between learning a density model to evaluate the surprise of a stimulus, and improving the policy to seek more predictable stimuli. The policy seeks out stable and repeatable situations that counteract the environment's prevailing sources of entropy. This might include avoiding other hostile agents, or finding a stable, balanced pose for a bipedal robot in the face of disturbance forces. We demonstrate that our surprise minimizing agents can successfully play Tetris, Doom, control a humanoid to avoid falls, and navigate to escape enemies in a maze without any task-specific reward supervision. We further show that SMiRL can be used together with standard task rewards to accelerate reward-driven learning.

1. INTRODUCTION

Organisms can carve out environmental niches within which they can maintain relative predictability amidst the entropy around them (Boltzmann, 1886; Schrödinger, 1944; Schneider & Kay, 1994; Friston, 2009) . For example, humans go to great lengths to shield themselves from surprise -we band together to build cities with homes, supplying water, food, gas, and electricity to control the deterioration of our bodies and living spaces amidst heat, cold, wind and storm. These activities exercise sophisticated control over the environment, which makes the environment more predictable and less "surprising" (Friston, 2009; Friston et al., 2009) . Could the motive of preserving order guide the automatic acquisition of useful behaviors in artificial agents? We study this question in the context of unsupervised reinforcement learning, which deals with the problem of acquiring complex behaviors and skills with no supervision (labels) or incentives (external rewards). Many previously proposed unsupervised reinforcement learning methods focus on noveltyseeking behaviors (Schmidhuber, 1991; Lehman & Stanley, 2011; Still & Precup, 2012; Bellemare et al., 2016; Houthooft et al., 2016; Pathak et al., 2017) . Such methods can lead to meaningful behavior in simulated environments, such as video games, where interesting and novel events mainly happen when the agent executes a specific and coherent pattern of behavior. However, we posit that in more realistic open-world environments, natural forces outside of the agent's control already offer an excellent source of novelty: from other agents to unexpected natural forces, agents in these settings must contend with a constant stream of unexpected events. In such settings, rejecting perturbations and maintaining a steady equilibrium may pose a greater challenge than novelty seeking. Based on this observation, we devise an algorithm, surprise minimizing reinforcement learning (SMiRL), that specifically aims to reduce the entropy of the states visited by the agent. SMiRL maintains an estimate of the distribution of visited states, p θ (s), and a policy that seeks to reach likely future states under p θ (s). After each action, p θ (s) is updated with the new state, while the policy is conditioned on the parameters of this distribution to construct a stationary MDP. We illustrate this with a diagram in Figure 1a . We empirically evaluate SMiRL in a range of domains that

