LEARNING WITH AMIGO: ADVERSARIALLY MOTIVATED INTRINSIC GOALS

Abstract

A key challenge for reinforcement learning (RL) consists of learning in environments with sparse extrinsic rewards. In contrast to current RL methods, humans are able to learn new skills with little or no reward by using various forms of intrinsic motivation. We propose AMIGO, a novel agent incorporating-as form of meta-learning-a goal-generating teacher that proposes Adversarially Motivated Intrinsic GOals to train a goal-conditioned "student" policy in the absence of (or alongside) environment reward. Specifically, through a simple but effective "constructively adversarial" objective, the teacher learns to propose increasingly challenging-yet achievable-goals that allow the student to learn general skills for acting in a new environment, independent of the task to be solved. We show that our method generates a natural curriculum of self-proposed goals which ultimately allows the agent to solve challenging procedurally-generated tasks where other forms of intrinsic motivation and state-of-the-art RL methods fail.

1. INTRODUCTION

The success of Deep Reinforcement Learning (RL) on a wide range of tasks, while impressive, has so far been mostly confined to scenarios with reasonably dense rewards (e.g. Mnih et al., 2016; Vinyals et al., 2019) , or to those where a perfect model of the environment can be used for search, such as the game of Go and others (e.g. Silver et al., 2016; Duan et al., 2016; Moravcík et al., 2017) . Many real-world environments offer extremely sparse rewards, if any at all. In such environments, random exploration, which underpins many current RL approaches, is likely to not yield sufficient reward signal to train an agent, or be very sample inefficient as it requires the agent to stumble onto novel rewarding states by chance. In contrast, humans are capable of dealing with rewards that are sparse and lie far in the future. For example, to a child, the future adult life involving education, work, or marriage provides no useful reinforcement signal. Instead, children devote much of their time to play, generating objectives and posing challenges to themselves as a form of intrinsic motivation. Solving such self-proposed tasks encourages them to explore, experiment, and invent; sometimes, as in many games and fantasies, without any direct link to reality or to any source of extrinsic reward. This kind of intrinsic motivation might be a crucial feature to enable learning in real-world environments (Schulz, 2012) . To address this discrepancy between naïve deep RL exploration strategies and human capabilities, we present a novel meta-learning method wherein part of the agent learns to self-propose Adversarially Motivated Intrinsic Goals (AMIGO). In AMIGO, the agent is decomposed into a goal-generating teacher and a goal-conditioned student policy. The teacher acts as a constructive adversary to the student: the teacher is incentivized to propose goals that are not too easy for the student to achieve, but not impossible either. This results in a natural curriculum of increasingly harder intrinsic goals that challenge the agent and encourage learning about the dynamics of a given environment. AMIGO can be viewed as an augmentation of any agent trained with policy gradient-based methods. Under this view, the original policy network becomes the student policy, which only requires its inputprocessing component to be adapted to accept an additional goal specification modality. The teacher policy can then be seen as a "bolt-on" to the original policy network, entailing that this method is-to the extent that the aforementioned goal-conditioning augmentation is possible-architecture-agnostic, and can be used on a variety of RL training model architectures and training settings. As advocated in recent work (Cobbe et al., 2019; Zhong et al., 2020; Risi & Togelius, 2019; Küttler et al., 2020) , we evaluate AMIGO for procedurally-generated environments instead of trying to learn to perform a specific task. Procedurally-generated environments are challenging since agents have to deal with a parameterized family of tasks, resulting in large observation spaces where memorizing trajectories is infeasible. Instead, agents have to learn policies that generalize across different environment layouts and transition dynamics (Rajeswaran et al., 2017; Machado et al., 2018; Foley et al., 2018; Zhang et al., 2018) . Concretely, we use MiniGrid (Chevalier-Boisvert et al., 2018) , a suite of fast-to-run procedurally-generated environments with a symbolic/discrete (expressed in terms of objects like walls, doors, keys, chests and balls) observation space which isolates the problem of exploration from that of visual perception. MiniGrid is a widely recognized challenging benchmark for intrinsic motivation, which was used in many recent publications such as Goyal et al. 2019 , Bougie et al. 2019 , Raileanu and Rocktäschel 2020 , Modhe et al. 2020 etc. We evaluate our method on six different tasks from the MiniGrid domain with varying degrees of difficulties, in which the agent needs to acquire a diverse range of skills in order to succeed. Furthermore, MiniGrid is complex and competitive baselines such as IMPALA (that achieve SOTA in other domains like Atari) fail. Raileanu & Rocktäschel (2020) found that MiniGrid presents a particular challenge for existing state-of-the-art intrinsic motivation approaches. Here, AMIGO sets a new state-of-the-art on some of the hardest MiniGrid environments, being the only method capable of successfully obtaining extrinsic reward on some of them. In summary, we make the following contributions: (i) we propose Adversarially Motivated Intrinsic GOals-an approach for learning a teacher that generates increasingly harder goals, (ii) we show, through 114 experiments on 6 challenging exploration tasks in procedurally generated environments, that agents trained with AMIGO gradually learn to interact with the environment and solve tasks which are too difficult for state-of-the-art methods, and (iii) we perform an extensive qualitative analysis and ablation study.



Figure1: Training with AMIGO consists of combining two modules: a goal-generating teacher and a goal-conditioned student policy, whereby the teacher provides intrinsic goals to supplement the extrinsic goals from the environment. In our experimental set-up, the teacher is a dimensionalitypreserving convolutional network which, at the beginning of an episode, outputs a location in absolute (x, y) coordinates. These are provided as a one-hot indicator in an extra channel of the student's convolutional neural network, which in turn outputs the agent's actions.

