GOAL-AUXILIARY ACTOR-CRITIC FOR 6D ROBOTIC GRASPING WITH POINT CLOUDS Anonymous

Abstract

6D robotic grasping beyond top-down bin-picking scenarios is a challenging task. Previous solutions based on 6D grasp synthesis with robot motion planning usually operate in an open-loop setting without considering perception feedback and dynamics and contacts of objects, which makes them sensitive to grasp synthesis errors. In this work, we propose a novel method for learning closed-loop control policies for 6D robotic grasping using point clouds from an egocentric camera. We combine imitation learning and reinforcement learning in order to grasp unseen objects and handle the continuous 6D action space, where expert demonstrations are obtained from a joint motion and grasp planner. We introduce a goal-auxiliary actor-critic algorithm, which uses grasping goal prediction as an auxiliary task to facilitate policy learning. The supervision on grasping goals can be obtained from the expert planner for known objects or from hindsight goals for unknown objects. Overall, our learned closed-loop policy achieves over 90% success rates on grasping various ShapeNet objects and YCB objects in simulation. The policy also transfers well to the real world with only one failure among grasping of ten different unseen objects in the presence of perception noises 1 .

1. INTRODUCTION

Robotic grasping of arbitrary objects is a challenging task. A robot needs to deal with objects it has never seen before, and generates a motion trajectory to grasp an object. Due to the complexity of the problem, majority works in the literature focus on bin-picking tasks, where top-down grasping is sufficient to pick up an object. Both grasp detection approaches (Redmon & Angelova, 2015; Pinto & Gupta, 2016; Mahler et al., 2017) and reinforcement learning-based methods (Kalashnikov et al., 2018; Quillen et al., 2018) are introduced to tackle the top-down grasping problem. However, it is difficult for these methods to grasp objects in environments where 6D grasping is necessary, i.e., 3D translation and 3D rotation of the robot gripper, such as a cereal box on a tabletop or in a cabinet. While 6D grasp synthesis has been studied using 3D models of objects (Miller & Allen, 2004; Eppner et al., 2019) and partial observations (ten Pas et al., 2017; Yan et al., 2018; Mousavian et al., 2019) , these methods only generate 6D grasp poses of the robot gripper for an object, instead of generating a trajectory of the gripper pose to reach and grasp the object. As a result, a motion planner is needed to plan the grasping motion according to the grasp poses. Usually, the planned trajectory is executed in an open-loop fashion since re-planning is expensive, and perception feedback during grasping as well as dynamics and contacts of the object are often ignored, which makes the grasping sensitive to grasp synthesis errors. In this work, to overcome the limitations in the paradigm of 6D grasp synthesis followed by robot motion planning, we introduce a novel method for learning closed-loop 6D grasping polices from partially-observed point clouds of objects. Our policy directly outputs the control action of the robot gripper, which is the relative 6D pose transformation of the gripper. For the state representation, we adopt an egocentric view with a wrist camera mounted on the robot gripper, which avoids self-occlusion of the robot arm during grasping compared to using an external static camera. Additionally, we aggregate point clouds of the object from previous time steps to avoid ambiguities in the current view and encode the history observations. Our point cloud representation provides richer 3D information for 6D grasping and generalizes better to different objects compared to RGB images. , 2015) , which is an actorcritic algorithm in RL that can utilize off-policy data from demonstrations. More importantly, we introduce a goal prediction auxiliary task to improve the policy learning, where the actor and the critic in DDPG are trained to predict the final 6D grasping pose of the robot gripper as well. The supervision on goal prediction comes from the expert planner for objects with known 3D shape and pose. For unknown objects without 3D models available, we can still obtain the grasping goals from successful grasping rollouts of the policy, i.e., hindsight goals. This property enables our learned policy to be finetuned on unknown objects, which is critical for continual learning in the real world. Figure 1 illustrates our setting for learning the 6D grasp polices. We conduct thorough analyses and evaluations of our method for 6D grasping. We show that our learned policy can successfully grasp unseen ShapeNet objects and unseen YCB objects (Calli et al., 2015) in simulation, and finetuning the policy with hindsight goals for unknown YCB objects improves the grasping success rate. In the real world, we utilized a recent unseen object instance segmentation method (Xiang et al., 2020) to segment the point cloud of an target object, and then applied GA-DDPG for grasping. The learned policy is able to successfully grasp YCB objects in the real world. It only failed one among ten grasping of different unseen objects with perception noises. Overall, our contributions are three-folds: 1) We propose to use point cloud as our state representation and use demonstrations from a joint motion and grasp planner to learn closed-loop 6D grasping policies. 2) We introduce the Goal-Auxiliary DDPG (GA-DDPG) algorithm for joint imitation and reinforcement learning using goals. 3) We propose to use hindsight goals for finetuning a pre-trained policy on unknown objects without goal supervision.

2. RELATED WORK

Combining Imitation Learning and Reinforcement Learning. For high-dimensional continuous state-action space with sparse rewards and complex dynamics as in most real-world robotic settings, model-free RL provides a data-driven approach to solve the task (Kalashnikov et al., 2018; Quillen et al., 2018) , but it requires a large number of interactions even with full-state information. Therefore, many works have proposed to combine imitation learning (Osa et al., 2018) 



Videos and code are available at https://sites.google.com/view/gaddpg



Figure1: Our method learns the 6D grasping policy with a goal-auxiliary task using an egocentric camera (goals denoted as green forks). We combine imitation learning with a planner (green trajectory) and reinforcement learning for known objects. When finetuning the policy with unknown objects, we use hindsight goals from successful episodes (red trajectory) as supervision. The policy learned in simulation can be successfully applied to the real world for grasping unseen objects.We propose to combine Imitation Learning (IL) and Reinforcement Learning (RL) in learning the 6D grasping policy. Since RL requires exploration of the state space, the chance of lifting an object is very rare. Moreover, the target object can easily fall down with a bad contact and the ego-view camera may lose the object during grasping. Different from previous works(Song et al., 2020;  Young et al., 2020)  that specifically design a robot gripper and collect human demonstrations using it, we obtain demonstrations using a joint motion and grasp planner (Wang et al., 2020) in simulation. Consequently, we can efficiently obtain a large number of 6D grasping trajectories using ShapeNet objects(Chang et al., 2015)  with the planner. Then we learn the grasping policy based on the Deep Deterministic Policy Gradient (DDPG) algorithm(Lillicrap et al., 2015), which is an actorcritic algorithm in RL that can utilize off-policy data from demonstrations. More importantly, we introduce a goal prediction auxiliary task to improve the policy learning, where the actor and the critic in DDPG are trained to predict the final 6D grasping pose of the robot gripper as well. The supervision on goal prediction comes from the expert planner for objects with known 3D shape and pose. For unknown objects without 3D models available, we can still obtain the grasping goals from successful grasping rollouts of the policy, i.e., hindsight goals. This property enables our learned policy to be finetuned on unknown objects, which is critical for continual learning in the real world. Figure1illustrates our setting for learning the 6D grasp polices.

in RL. For example, Rajeswaran et al. (2017) augment policy gradient update with demonstration data to circumvent reward shaping and the exploration challenge. Zhu et al. (2018) use inverse RL to learn dexterous manipulation tasks with a few human demonstrations in simulation. The closest related works to ours are Vecerik et al. (2017); Nair et al. (2018a) that utilize demonstration data with offpolicy RL. Despite the focus on different tasks, the main difference is that our demonstrations are

