EQUIVARIANT DESCRIPTOR FIELDS: SE(3)-EQUIVARIANT ENERGY-BASED MODELS FOR END-TO-END VISUAL ROBOTIC MANIPULATION LEARNING

Abstract

End-to-end learning for visual robotic manipulation is known to suffer from sample inefficiency, requiring large numbers of demonstrations. The spatial rototranslation equivariance, or the SE(3)-equivariance can be exploited to improve the sample efficiency for learning robotic manipulation. In this paper, we present SE(3)-equivariant models for visual robotic manipulation from point clouds that can be trained fully end-to-end. By utilizing the representation theory of the Lie group, we construct novel SE(3)-equivariant energy-based models that allow highly sample efficient end-to-end learning. We show that our models can learn from scratch without prior knowledge and yet are highly sample efficient (5∼10 demonstrations are enough). Furthermore, we show that our models can generalize to tasks with (i) previously unseen target object poses, (ii) previously unseen target object instances of the category, and (iii) previously unseen visual distractors. We experiment with 6-DoF robotic manipulation tasks to validate our models' sample efficiency and generalizability.

1. INTRODUCTION

Learning robotic manipulation from scratch often involves learning from mistakes, making realworld applications highly impractical (Kalashnikov et al., 2018; Levine et al., 2016; Lee & Choi, 2022) . Learning from demonstration (LfD) methods (Ravichandar et al., 2020; Argall et al., 2009) are advantageous because they do not involve trial and error, but expert demonstrations are often rare and expensive to collect. Therefore, auxiliary pipelines such as pose estimation (Zeng et al., 2017; Deng et al., 2020) , segmentation (Simeonov et al., 2021) , or pre-trained object representations (Florence et al., 2018; Kulkarni et al., 2019) are commonly used to improve data efficiency. However, collecting sufficient data for training such pipelines is often burdensome or unavailable in practice. Recently, roto-translation equivariance has been explored for sample-efficient robotic manipulation learning. Transporter Networks (Zeng et al., 2020) achieve high sample efficiency in end-to-end visual robotic manipulation learning by exploiting SE(2)-equivariance (planar roto-translation equivariance). However, the efficiency of Transporter Networks is limited to planar tasks due to the lack of the full SE(3)-equivariance (spatial roto-translation equivariance). In contrast, Neural Descriptor Fields (NDFs) (Simeonov et al., 2021) can achieve few-shot level sample efficiency in learning highly spatial tasks by exploiting the SE(3)-equivariance. Moreover, the trained NDFs can generalize to previously unseen object instances (in the same category) in unseen poses. However, unlike Transporter Networks, NDFs cannot be end-to-end trained from demonstrations. The neural networks of NDFs need to be pre-trained with auxiliary self-supervised learning tasks and cannot be fine-tuned for the demonstrated tasks. Furthermore, NDFs can only be used for well-segmented point cloud inputs and fixed placement targets. These limitations make it difficult to apply NDFs when 1) no public dataset is available for pre-training on the specific target object category, 2) when well-segmented object point clouds cannot be expected, or when 3) the placement target is not fixed. 1

availability

https://github.com/tomato1mule/edf 

