MANISKILL2: A UNIFIED BENCHMARK FOR GENERALIZABLE MANIPULATION SKILLS

Abstract

Generalizable manipulation skills, which can be composed to tackle longhorizon and complex daily chores, are one of the cornerstones of Embodied AI. However, existing benchmarks, mostly composed of a suite of simulatable environments, are insufficient to push cutting-edge research works because they lack object-level topological and geometric variations, are not based on fully dynamic simulation, or are short of native support for multiple types of manipulation tasks. To this end, we present ManiSkill2, the next generation of the SAPIEN ManiSkill benchmark, to address critical pain points often encountered by researchers when using benchmarks for generalizable manipulation skills. ManiSkill2 includes 20 manipulation task families with 2000+ object models and 4M+ demonstration frames, which cover stationary/mobile-base, single/dualarm, and rigid/soft-body manipulation tasks with 2D/3D-input data simulated by fully dynamic engines. It defines a unified interface and evaluation protocol to support a wide range of algorithms (e.g., classic sense-plan-act, RL, IL), visual observations (point cloud, RGBD), and controllers (e.g., action type and parameterization). Moreover, it empowers fast visual input learning algorithms so that a CNN-based policy can collect samples at about 2000 FPS with 1 GPU and 16 processes on a regular workstation. It implements a render server infrastructure to allow sharing rendering resources across all environments, thereby significantly reducing memory usage. We open-source all codes of our benchmark (simulator, environments, and baselines) and host an online challenge open to interdisciplinary researchers.



Figure 1 : ManiSkill2 provides a unified, fast, and accessible system that encompasses well-curated manipulation tasks (e.g., stationary/mobile-base, single/dual-arm, rigid/soft-body).

1. INTRODUCTION

Mastering human-like manipulation skills is a fundamental but challenging problem in Embodied AI, which is at the nexus of vision, learning, and robotics. Remarkably, once humans have learnt to manipulate a category of objects, they are able to manipulate unseen objects (e.g., with different appearances and geometries) of the same category in unseen configurations (e.g., initial poses). We refer such abilities to interact with a great variety of even unseen objects in different configurations as generalizable manipulation skills. Generalizable manipulation skills are one of the cornerstones of Embodied AI, which can be composed to tackle long-horizon and complex daily chores (Ahn et al., 2022; Gu et al., 2022) . To foster further interdisciplinary and reproducible research on generalizable manipulation skills, it is crucial to build a versatile and public benchmark that focuses on object-level topological and geometric variations as well as practical manipulation challenges. However, most prior benchmarks are insufficient to support and evaluate progress in learning generalizable manipulation skills. In this work, we present ManiSkill2, the next generation of SAPIEN ManiSkill Benchmark (Mu et al., 2021) , which extends upon fully simulated dynamics, a large variety of articulated objects, and large-scale demonstrations from the previous version. Moreover, we introduce significant improvements and novel functionalities, as shown below. 1) A Unified Benchmark for Generic and Generalizable Manipulation Skills: There does not exist a standard benchmark to measure different algorithms for generic and generalizable manipulation skills. It is largely due to well-known challenges to build realistically simulated environments with diverse assets. Many benchmarks bypass critical challenges as a trade-off, by either adopting abstract grasp (Ehsani et al., 2021; Szot et al., 2021; Srivastava et al., 2022) or including few object-level variations (Zhu et al., 2020; Yu et al., 2020; James et al., 2020) . Thus, researchers usually have to make extra efforts to customize environments due to limited functionalities, which in turn makes reproducible comparison difficult. For example, Shridhar et al. ( 2022) modified 18 tasks in James et al. ( 2020) to enable few variations in initial states. Besides, some benchmarks are biased towards a single type of manipulation, e.g., 4-DoF manipulation in Yu et al. (2020) . To address such pain points, ManiSkill2 includes a total of 20 verified and challenging manipulation task families of multiple types (stationary/mobile-base, single/dual-arm, rigid/softbody), with over 2000 objects and 4M demonstration frames, to support generic and generalizable manipulation skills. All the tasks are implemented in a unified OpenAI Gym (Brockman et al., 2016) interface with fully-simulated dynamic interaction, supporting multiple observation modes (point cloud, RGBD, privileged state) and multiple controllers. A unified protocol is defined to evaluate a wide range of algorithms (e.g., sense-plan-act, reinforcement and imtation learning) on both seen and unseen assets as well as configurations. In particular, we implement a cloud-based evaluation system to publicly and fairly compare different approaches. 2) Real-time Soft-body Environments: When operating in the real world, robots face not only rigid bodies, but many types of soft bodies, such as cloth, water, and soil. Many simulators have supported robotic manipulation with soft body simulation. For example, MuJoCo (Todorov et al., 2012) and Bullet Coumans & Bai (2016-2021) use the finite element method (FEM) to enable the simulation of rope, cloth, and elastic objects. However, FEM-based methods cannot handle large deformation and topological changes, such as scooping flour or cutting dough. Other environments, like SoftGym (Lin et al., 2020) and ThreeDWorld (Gan et al., 2020) , are based on Nvidia Flex, which can simulate large deformations, but cannot realistically simulate elasto-plastic material, e.g., clay. PlasticineLab (Huang et al., 2021) deploys the continuum-mechanics-based material point method (MPM), but it lacks the ability to couple with rigid robots, and its simulation and rendering performance have much room for improvement. We have implemented a custom GPU MPM simulator from scratch using Nvidia's Warp (Macklin, 2022) JIT framework and native CUDA for high efficiency and customizability. We further extend Warp's functionality to support more efficient host-device communication. Moreover, we have supported a 2-way dynamics coupling interface that enables any rigid-body simulation framework to interact with the soft bodies, allowing robots and assets in ManiSkill2 to interact with soft-body simulation seamlessly. To our knowledge, ManiSkill2 is the first embodied AI environment to support 2-way coupled rigid-MPM simulation, and also the first to support real-time simulation and rendering of MPM material. 3) Multi-controller Support and Conversion of Demonstration Action Spaces: Controllers transform policies' action outputs into motor commands that actuate the robot, which define the action space of a task. Martín-Martín et al. (2019); Zhu et al. (2020) show that the choice of action



† and * indicate equal contribution (in alphabetical order). See Appendix H for contributions. Project website: https://maniskill2.github.io/ Codes: https://github.com/haosulab/ManiSkill2 Challenge website: https://sapien.ucsd.edu/challenges/maniskill/2022/

