ASYMMETRIC SELF-PLAY FOR AUTOMATIC GOAL DIS-COVERY IN ROBOTIC MANIPULATION

Abstract

We train a single, goal-conditioned policy that can solve many robotic manipulation tasks, including tasks with previously unseen goals and objects. We rely on asymmetric self-play for goal discovery, where two agents, Alice and Bob, play a game. Alice is asked to propose challenging goals and Bob aims to solve them. We show that this method can discover highly diverse and complex goals without any human priors. Bob can be trained with only sparse rewards, because the interaction between Alice and Bob results in a natural curriculum and Bob can learn from Alice's trajectory when relabeled as a goal-conditioned demonstration. Finally, our method scales, resulting in a single policy that can generalize to many unseen tasks such as setting a table, stacking blocks, and solving simple puzzles.

1. INTRODUCTION

We are motivated to train a single goal-conditioned policy (Kaelbling, 1993) that can solve any robotic manipulation task that a human may request in a given environment. In this work, we make progress towards this goal by solving a robotic manipulation problem in a table-top setting where the robot's task is to change the initial configuration of a variable number of objects on a table to match a given goal configuration. This problem is simple in its formulation but likely to challenge a wide variety of cognitive abilities of a robot as objects become diverse and goals become complex. Motivated by the recent success of deep reinforcement learning for robotics (Levine et al., 2016; Gu et al., 2017; Hwangbo et al., 2019; OpenAI et al., 2019a) , we tackle this problem using deep reinforcement learning on a very large training distribution. An open question in this approach is how we can build a training distribution rich enough to achieve generalization to many unseen manipulation tasks. This involves defining both an environment's initial state distribution and a goal distribution. The initial state distribution determines how we sample a set of objects and their configuration at the beginning of an episode, and the goal distribution defines how we sample target states given an initial state. In this work, we focus on a scalable way to define a rich goal distribution. The research community has started to explore automated ways of defining goal distributions. For example, previous works have explored learning a generative model of goal distributions (Florensa et al., 2018; Nair et al., 2018b; Racaniere et al., 2020) and collecting teleoperated robot trajectories 



Figure 1: (a) We train a policy that controls a robot arm operating in a table-top setting. (b) Randomly placed ShapeNet (Chang et al., 2015) objects constitute an initial state distribution for training. (c) We use multiple manually designed holdout tasks to evaluate the learned policy.

