ASYMMETRIC SELF-PLAY FOR AUTOMATIC GOAL DIS-COVERY IN ROBOTIC MANIPULATION

Abstract

We train a single, goal-conditioned policy that can solve many robotic manipulation tasks, including tasks with previously unseen goals and objects. We rely on asymmetric self-play for goal discovery, where two agents, Alice and Bob, play a game. Alice is asked to propose challenging goals and Bob aims to solve them. We show that this method can discover highly diverse and complex goals without any human priors. Bob can be trained with only sparse rewards, because the interaction between Alice and Bob results in a natural curriculum and Bob can learn from Alice's trajectory when relabeled as a goal-conditioned demonstration. Finally, our method scales, resulting in a single policy that can generalize to many unseen tasks such as setting a table, stacking blocks, and solving simple puzzles.

1. INTRODUCTION

We are motivated to train a single goal-conditioned policy (Kaelbling, 1993) that can solve any robotic manipulation task that a human may request in a given environment. In this work, we make progress towards this goal by solving a robotic manipulation problem in a table-top setting where the robot's task is to change the initial configuration of a variable number of objects on a table to match a given goal configuration. This problem is simple in its formulation but likely to challenge a wide variety of cognitive abilities of a robot as objects become diverse and goals become complex. Motivated by the recent success of deep reinforcement learning for robotics (Levine et al., 2016; Gu et al., 2017; Hwangbo et al., 2019; OpenAI et al., 2019a) , we tackle this problem using deep reinforcement learning on a very large training distribution. An open question in this approach is how we can build a training distribution rich enough to achieve generalization to many unseen manipulation tasks. This involves defining both an environment's initial state distribution and a goal distribution. The initial state distribution determines how we sample a set of objects and their configuration at the beginning of an episode, and the goal distribution defines how we sample target states given an initial state. In this work, we focus on a scalable way to define a rich goal distribution. The research community has started to explore automated ways of defining goal distributions. For example, previous works have explored learning a generative model of goal distributions (Florensa et al., 2018; Nair et al., 2018b; Racaniere et al., 2020) and collecting teleoperated robot trajectories to identify goals (Lynch et al., 2020; Gupta et al., 2020) . In this paper, we extend an alternative approach called asymmetric self-play (Sukhbaatar et al., 2018b; a) for automated goal generation. Asymmetric self-play trains two RL agents named Alice and Bob. Alice learns to propose goals that Bob is likely to fail at, and Bob, a goal-conditioned policy, learns to solve the proposed goals. Alice proposes a goal by manipulating objects and Bob has to solve the goal starting from the same initial state as Alice's. By embodying these two agents into the same robotic hardware, this setup ensures that all proposed goals are provided with at least one solution: Alice's trajectory. There are two main reasons why we consider asymmetric self-play to be a promising goal generation and learning method. First, any proposed goal is achievable, meaning that there exists at least one solution trajectory that Bob can follow to achieve the goal. Because of this property, we can exploit Alice's trajectory to provide additional learning signal to Bob via behavioral cloning. This additional learning signal alleviates the overhead of heuristically designing a curriculum or reward shaping for learning. Second, this approach does not require labor intensive data collection. In this paper, we show that asymmetric self-play can be used to train a goal-conditioned policy for complex object manipulation tasks, and the learned policy can zero-shot generalize to many manually designed holdout tasks, which consist of either previously unseen goals, previously unseen objects, or both. To the best of our knowledge, this is the first work that presents zero-shot generalization to many previously unseen tasks by training purely with asymmetric self-play.foot_0 

2. PROBLEM FORMULATION

Our training environment for robotic manipulation consists of a robot arm with a gripper attached and a wide range of objects placed on a table surface (Figure 1a, 1b) . The goal-conditioned policy learns to control the robot to rearrange randomly placed objects (the initial state) into a specified goal configuration (Figure 1c ). We aim to train a policy on a single training distribution and to evaluate its performance over a suite of holdout tasks which are independently designed and not explicitly present during training (Figure 2a ). In this work, we construct the training distribution via asymmetric self-play (Figure 2b ) to achieve generalization to many unseen holdout tasks (Figure 1c ). Mathematical formulation Formally, we model the interaction between an environment and a goal-conditioned policy as a goal-augmented Markov decision process M = hS, A, P, R, Gi, where S is the state space, A is the action space, P : S ⇥ A ⇥ S 7 ! R denotes the transition probability, G ✓ S specifies the goal space and R : S ⇥ G 7 ! R is a goal-specific reward function. A goalaugmented trajectory sequence is {(s 0 , g, a 0 , r 0 ), . . . , (s t , g, a t , r t )}, where the goal is provided to the policy as part of the observation at every step. We say a goal is achieved if s t is sufficiently close to g (Appendix A.2). With a slightly overloaded notation, we define the goal distribution G(g|s 0 ) as the probability of a goal state g 2 G conditioned on an initial state s 0 2 S.



Asymmetric self-play is proposed inSukhbaatar et al. (2018b;a), but to supplement training while the majority of training is conducted on target tasks. Zero-shot generalization to unseen tasks was not evaluated.



Figure 1: (a) We train a policy that controls a robot arm operating in a table-top setting. (b) Randomly placed ShapeNet (Chang et al., 2015) objects constitute an initial state distribution for training. (c) We use multiple manually designed holdout tasks to evaluate the learned policy.

Figure 2: (a) We train a goal-conditioned policy on a single training distribution and evaluate its performance on many unseen holdout tasks. (b) To construct a training distribution, we sample an initial state from a predefined distribution, and run a goal setting policy (Alice) to generate a goal. In one episode, Alice is asked to generate 5 goals and Bob solves them in sequence until it fails.

