SELF-SUPERVISED CATEGORY-LEVEL ARTICULATED OBJECT POSE ESTIMATION WITH PART-LEVEL SE(3) EQUIVARIANCE

Abstract

Category-level articulated object pose estimation aims to estimate a hierarchy of articulation-aware object poses of an unseen articulated object from a known category. To reduce the heavy annotations needed for supervised learning methods, we present a novel self-supervised strategy that solves this problem without any human labels. Our key idea is to factorize canonical shapes and articulated object poses from input articulated shapes through part-level equivariant shape analysis. Specifically, we first introduce the concept of part-level SE(3) equivariance and devise a network to learn features of such property. Then, through a carefully designed fine-grained pose-shape disentanglement strategy, we expect that canonical spaces to support pose estimation could be induced automatically. Thus, we could further predict articulated object poses as per-part rigid transformations describing how parts transform from their canonical part spaces to the camera space. Extensive experiments demonstrate the effectiveness of our method on both complete and partial point clouds from synthetic and real articulated object datasets. The project page with code and more information can be found at: equi-articulated-pose.github.io.

1. INTRODUCTION

Articulated object pose estimation is a crucial and fundamental computer vision problem with a wide range of applications in robotics, human-object interaction, and augmented reality Katz & Brock (2008) Given a collection of unsegmented articulated objects in various articulation states with different object poses, our goal is to design a network that can acquire a category-level articulated object pose understanding in a self-supervised manner without any human labels such as pose annotations, segmentation labels, or reference frames for pose definition. The self-supervised category-level articulated object pose estimation problem is highly ill-posed since it requires the knowledge of object structure and per-part poses, which are usually entangled with part shapes. Very few previous works try to solve such a problem or even similar ones. The most related attempt is the work of Li et al. (2021) . It tackles the unsupervised category-level pose estimation problem but just for rigid objects. They leverages SE(3) equivariant shape analysis to disentangle the global object pose and shape information so that a category-aligned canonical object space can emerge. This way the category-level object poses could be automatically learned by predicting a transformation from the canonical space to the camera space. Going beyond rigid objects, estimating articulated object poses demands more than just global pose and shape disentanglement. It requires a more fine-grained disentanglement of part shape, object structure such as part adjacency relationship, joint states, part poses, and so on. To achieve such fine-grained disentanglement, we propose to leverage part-level SE(3) equivariant shape analysis. Especially, we introduce the concept of part-level SE(3) equivariant features to equip equivariance with a spatial support. The part-level SE(3) equivariant feature of a local region should only change as its parent part transforms but should not be influenced by the transformation of other parts. This is in contrast to the object-level SE(3) equivariant feature for a local region, which is influenced by both the region's parent part and other parts. To densely extract part-level SE(3) equivariant features from an articulated shape, we propose a novel pose-aware equivariant point convolution operator. Based on such features, we are able to achieve a fine-grained disentanglement which learns three types of information from input shapes: 1) Canonical part shapes, which are invariant to input pose or articulation changes and are category-aligned to provide a consistent reference frame for part poses; 2) Object structure, which is also invariant to input pose or articulation changes and contains structural information about the part adjacency relationships, part transformation order, and joint parameters such as pivot points; 3) Articulated object pose, which is composed of a series of estimated transformations. Such transformations include per-part rigid transformations which assembles canonical part shapes into a canonical object shape, per-part articulated transformation which articulates the canonical object shape to match the input articulation state, and a base part rigid transformation transforming the articulated canonical object to the camera space. To allow such disentanglement, we guide the network learning through a self-supervised part-by-part shape reconstruction task that combines the disentangled information to recover the input shapes. With the above self-supervised disentanglement strategy, our method demonstrates the possibility of estimating articulated object poses in a self-supervised way for the first time. Extensive experiments prove its effectiveness on both complete point clouds and partial point clouds from various categories covering both synthetic and real datasets. On the Part-Mobility Dataset Wang et al. (2019b), our method without the need for any human annotations can already outperform the iterative pose estimation strategy with ground-truth segmentation masks on both complete and partial settings by a large margin, e.g. reduce the rotation estimation error by around 30 degrees on complete shapes and by 40 degrees on partial shapes. Besides, our method can perform on par with to or even better than supervised methods like NPCS Li et al. (2020a) . For instance, we can achieve an average of 7.9 • rotation estimation error on complete shapes, comparable to NPCS's 5.8 • error. We can even outperform NPCS on some specific categories such as partial Eyeglasses. Finally, we prove the effectiveness of our part-level SE(3) equivariance design and the fine-grained disentanglement strategy in the ablation study. Our main contributions are summarized as follows: • To our best knowledge, we are the first that tackles the self-supervised articulated object pose estimation problem. • We design a pose-aware equivariant point convolution operator to learn part-level SE(3)equivariant features. • We propose a self-supervised framework to achieve the disentanglement of canonical shape, object structure, and articulated object poses.

2. RELATED WORKS

Unsupervised Part Decomposition for 3D Objects. Decomposing an observed 3D object shape into parts in an unsupervised manner is a recent interest in shape representation learning. Previous works always tend to adopt a generative shape reconstruction task to self-supervise the shape decomposition. 



; Mu et al. (2021); Labbé et al. (2021); Jiang et al. (2022); Goyal et al. (2022); Li et al. (2020b). Different from 6D pose estimation for rigid objects Tremblay et al. (2018); Xiang et al. (2017); Sundermeyer et al. (2018); Wang et al. (2019a), articulated object pose estimation requires a hierarchical pose understanding on both the object-level and part-level Li et al. (2020a). This problem has been long studied on the instance level where an exact CAD model is required to understand the pose of a specific instance. Recently, there is a trend in estimating category-level object pose such that the algorithm can generalize to novel instances. Despite such merits, supervised category-level approaches always assume rich annotations that are extremely expensive to acquire Li et al. (2020a); Chi & Song (2021); Liu et al. (2022a). To get rid of such restrictions, we tackle this problem under a self-supervised setting instead.

They often choose to represent parts via learnable primitive shapes Tulsiani et al. (2017); Kawana et al. (2020); Yang & Chen (2021); Paschalidou et al. (2021); Deng et al. (2020); Zhu et al. (2020); Chen et al. (2020) or non-primitive-based implicit field representation Chen et al. (2019); Kawana et al. (2021). Shape alignment is a common assumption of such methods to achieve consistent decomposition across different shapes. Articulated Object Pose Estimation. Pose estimation for articulated objects aims to acquire a fine-grained understanding of target articulated objects from both the object level and the part level. The prior work Li et al. (2020a) proposes to estimate object orientations, joint parameters, and

