SELF-SUPERVISED CATEGORY-LEVEL ARTICULATED OBJECT POSE ESTIMATION WITH PART-LEVEL SE(3) EQUIVARIANCE

Abstract

Category-level articulated object pose estimation aims to estimate a hierarchy of articulation-aware object poses of an unseen articulated object from a known category. To reduce the heavy annotations needed for supervised learning methods, we present a novel self-supervised strategy that solves this problem without any human labels. Our key idea is to factorize canonical shapes and articulated object poses from input articulated shapes through part-level equivariant shape analysis. Specifically, we first introduce the concept of part-level SE(3) equivariance and devise a network to learn features of such property. Then, through a carefully designed fine-grained pose-shape disentanglement strategy, we expect that canonical spaces to support pose estimation could be induced automatically. Thus, we could further predict articulated object poses as per-part rigid transformations describing how parts transform from their canonical part spaces to the camera space. Extensive experiments demonstrate the effectiveness of our method on both complete and partial point clouds from synthetic and real articulated object datasets. The project page with code and more information can be found at: equi-articulated-pose.github.io.

1. INTRODUCTION

Articulated object pose estimation is a crucial and fundamental computer vision problem with a wide range of applications in robotics, human-object interaction, and augmented reality Katz & Brock (2008) Given a collection of unsegmented articulated objects in various articulation states with different object poses, our goal is to design a network that can acquire a category-level articulated object pose understanding in a self-supervised manner without any human labels such as pose annotations, segmentation labels, or reference frames for pose definition. The self-supervised category-level articulated object pose estimation problem is highly ill-posed since it requires the knowledge of object structure and per-part poses, which are usually entangled with part shapes. Very few previous works try to solve such a problem or even similar ones. The most related attempt is the work of Li et al. (2021) . It tackles the unsupervised category-level pose estimation problem but just for rigid objects. They leverages SE(3) equivariant shape analysis to disentangle the global object pose and shape information so that a category-aligned canonical object space can emerge. This way the category-level object poses could be automatically learned by predicting a transformation from the canonical space to the camera space. Going beyond rigid objects, estimating



; Mu et al. (2021); Labbé et al. (2021); Jiang et al. (2022); Goyal et al. (2022); Li et al. (2020b). Different from 6D pose estimation for rigid objects Tremblay et al. (2018); Xiang et al. (2017); Sundermeyer et al. (2018); Wang et al. (2019a), articulated object pose estimation requires a hierarchical pose understanding on both the object-level and part-level Li et al. (2020a). This problem has been long studied on the instance level where an exact CAD model is required to understand the pose of a specific instance. Recently, there is a trend in estimating category-level object pose such that the algorithm can generalize to novel instances. Despite such merits, supervised category-level approaches always assume rich annotations that are extremely expensive to acquire Li et al. (2020a); Chi & Song (2021); Liu et al. (2022a). To get rid of such restrictions, we tackle this problem under a self-supervised setting instead.

