EQUIVARIANT DESCRIPTOR FIELDS: SE(3)-EQUIVARIANT ENERGY-BASED MODELS FOR END-TO-END VISUAL ROBOTIC MANIPULATION LEARNING

Abstract

End-to-end learning for visual robotic manipulation is known to suffer from sample inefficiency, requiring large numbers of demonstrations. The spatial rototranslation equivariance, or the SE(3)-equivariance can be exploited to improve the sample efficiency for learning robotic manipulation. In this paper, we present SE(3)-equivariant models for visual robotic manipulation from point clouds that can be trained fully end-to-end. By utilizing the representation theory of the Lie group, we construct novel SE(3)-equivariant energy-based models that allow highly sample efficient end-to-end learning. We show that our models can learn from scratch without prior knowledge and yet are highly sample efficient (5∼10 demonstrations are enough). Furthermore, we show that our models can generalize to tasks with (i) previously unseen target object poses, (ii) previously unseen target object instances of the category, and (iii) previously unseen visual distractors. We experiment with 6-DoF robotic manipulation tasks to validate our models' sample efficiency and generalizability.

1. INTRODUCTION

Learning robotic manipulation from scratch often involves learning from mistakes, making realworld applications highly impractical (Kalashnikov et al., 2018; Levine et al., 2016; Lee & Choi, 2022) . Learning from demonstration (LfD) methods (Ravichandar et al., 2020; Argall et al., 2009) are advantageous because they do not involve trial and error, but expert demonstrations are often rare and expensive to collect. Therefore, auxiliary pipelines such as pose estimation (Zeng et al., 2017; Deng et al., 2020) , segmentation (Simeonov et al., 2021) , or pre-trained object representations (Florence et al., 2018; Kulkarni et al., 2019) are commonly used to improve data efficiency. However, collecting sufficient data for training such pipelines is often burdensome or unavailable in practice. Recently, roto-translation equivariance has been explored for sample-efficient robotic manipulation learning. Transporter Networks (Zeng et al., 2020) achieve high sample efficiency in end-to-end visual robotic manipulation learning by exploiting SE(2)-equivariance (planar roto-translation equivariance). However, the efficiency of Transporter Networks is limited to planar tasks due to the lack of the full SE(3)-equivariance (spatial roto-translation equivariance). In contrast, Neural Descriptor Fields (NDFs) (Simeonov et al., 2021) can achieve few-shot level sample efficiency in learning highly spatial tasks by exploiting the SE(3)-equivariance. Moreover, the trained NDFs can generalize to previously unseen object instances (in the same category) in unseen poses. However, unlike Transporter Networks, NDFs cannot be end-to-end trained from demonstrations. The neural networks of NDFs need to be pre-trained with auxiliary self-supervised learning tasks and cannot be fine-tuned for the demonstrated tasks. Furthermore, NDFs can only be used for well-segmented point cloud inputs and fixed placement targets. These limitations make it difficult to apply NDFs when 1) no public dataset is available for pre-training on the specific target object category, 2) when well-segmented object point clouds cannot be expected, or when 3) the placement target is not fixed. To overcome such limitations, we present Equivariant Descriptor Fields (EDFs), the first end-to-end trainable and SE(3)-equivariant visual robotic manipulation models. EDFs can be fully end-to-end trained to solve highly spatial tasks from a few (5∼10) demonstrations without requiring any pre-training, object keypoint annotation, or segmentation. EDFs can generalize to previously unseen target object instances in unseen poses as NDFs. Furthermore, EDFs can generalize to unseen distracting objects and unseen placement poses (See Figure 1 ). Our contributions are as follows: 1. To enable end-to-end training, we reformulate the energy minimization problem of NDFs into a probabilistic learning framework with energy-based models on the SE(3)-manifold. 2. We generalize the invariant descriptors of NDFs into representation-theoretic equivariant descriptors. Using equivariant descriptors significantly improves generalizability owing to their orientational sensitivity. 3. We propose a novel energy function and end-to-end trainable query point models to achieve the SE(3)-equivariance regarding both the target object and placement target. 4. EDFs do not resort to non-local mechanisms to achieve the SE(3)-equivariance. This specific design enables our method to work well without object segmentation pipelines.

2. BACKGROUND AND RELATED WORKS

Equivariant Robotic Manipulation Equivariant models have emerged as a promising approach for robotic manipulation learning, with growing evidence indicating they can significantly improve both sample efficiency and generalizability (Wang & Walters, 2022; Wang et al., 2022) . Transporter Networks and their variants (Zeng et al., 2020; Seita et al., 2021) are end-to-end models for visual robotic manipulation tasks that exploit the planar roto-translation equivariance, or the SE(2)equivariance for the sample efficiency. Equivariant Transporter Networks (ETNs) (Huang et al., 2022) exploit the representation theory of discrete rotation groups to further improve the sample efficiency. However, the efficiency of SE(2)-equivariant models is limited to planar tasks and cannot be extended to highly spatial tasks. Neural Descriptor Fields (NDFs) (Simeonov et al., 2021) overcome this limitation by leveraging the spatial roto-translation equivariance, or the SE(3)-equivariance. Energy-Based Models Energy-based models (EBMs) are probabilistic models that are derived from energy functions. EBMs are widely used for image and video generation (Zhu & Mumford, 1998; Xie et al., 2016; 2017; Du & Mordatch, 2019) , 3D geometry generation (Xie et al., 2018b; 2021a ), internal learning (Zheng et al., 2021 ), and control (Xu et al., 2022; Florence et al., 2022) . Due to the intractability of the integral in the denominator of EBMs, Markov chain Monte Carlo (MCMC) methods are commonly used to estimate the gradient of the log-denominator to maximize the log-likelihood (Hinton, 2002; Carreira-Perpinan & Hinton, 2005) . The Metropolis-Hastings algorithm (MH) (Hastings, 1970) and the Langevin dynamics (Langevin, 1908; Welling & Teh, 2011) are widely used MCMC methods for EBMs on Euclidean spaces. However, typical Langevin dynamics cannot be used for non-Euclidean manifolds such as the SE(3) manifold. The Langevin dynamics on the SE(3) group and general Lie groups are studied by Brockett (1997) ; Chirikjian



Figure 1: Given few (5∼10) demonstrations of a mug pick-and-place task, EDFs can be trained fully end-to-end without requiring any pre-training, object segmentation, or pose estimation pipelines. In addition, we show that EDFs can generalize to A) unseen poses, B) unseen instances of the target object category, and C) the presence of unseen visual distractors.

availability

https://github.com/tomato1mule/edf 

