SELF-SUPERVISED GEOMETRIC CORRESPONDENCE FOR CATEGORY-LEVEL 6D OBJECT POSE ESTIMATION IN THE WILD

Abstract

While 6D object pose estimation has wide applications across computer vision and robotics, it remains far from being solved due to the lack of annotations. The problem becomes even more challenging when moving to category-level 6D pose, which requires generalization to unseen instances. Current approaches are restricted by leveraging annotations from simulation or collected from humans. In this paper, we overcome this barrier by introducing a self-supervised learning approach trained directly on large-scale real-world object videos for category-level 6D pose estimation in the wild. Our framework reconstructs the canonical 3D shape of an object category and learns dense correspondences between input images and the canonical shape via surface embedding. For training, we propose novel geometrical cycle-consistency losses which construct cycles across 2D-3D spaces, across different instances and different time steps. The learned correspondence can be applied for 6D pose estimation and other downstream tasks such as keypoint transfer. Surprisingly, our method, without any human annotations or simulators, can achieve on-par or even better performance than previous supervised or semisupervised methods on in-the-wild images. Code and videos are available at https://kywind.github.io/self-pose.

1. INTRODUCTION

Object 6D pose estimation is a long-standing problem for computer vision and robotics. In instancelevel 6D pose estimation, a model is trained to estimate the 6D pose for one single instance given its 3D shape template (He et al., 2020; Xiang et al., 2017; Oberweger et al., 2018) . For generalizing to unseen objects and removing the requirement of 3D CAD templates, approaches for category-level 6D pose estimation are proposed (Wang et al., 2019b) . However, learning a generalizable model requires a large amount of data and supervision. A common solution in most approaches (Wang et al., 2019b; Tian et al., 2020; Chen et al., 2020a; 2021; Lin et al., 2021) is leveraging both real-world (Wang et al., 2019b) and simulation labels (Wang et al., 2019b; Chang et al., 2015) at the same time for training. While there are limited labels from the real world given the high cost of 3D annotations, we can generate as many annotations as we want in simulation for free. However, it is very hard to model the large diversity of in-the-wild objects with a simulator, which introduces a large sim-to-real gap when transferring the model trained with synthetic data. Although the real-world labels are hard to obtain, the large-scale object data is much more achievable (Fu & Wang, 2022) . In this paper, we propose a self-supervised learning approach that directly trains on large-scale unlabeled object-centric videos for category-level 6D pose estimation. Our method does not require any 6D pose annotations from simulation or human labor for learning. This allows the trained model to generalize to in-the-wild data. Given a 3D object shape prior for each category, our model learns the 2D-3D dense correspondences between the input image pixels and the 3D points on the categorical shape prior, namely geometric correspondence. The object 6D pose can be solved with the correspondence pairs and the depth map using a pose fitting algorithm. We propose a novel Categorical Surface Embedding (CSE) representation, which is a feature field defined over the surface of the categorical canonical object mesh. Every vertex of the canonical mesh is encoded into a feature embedding to form the CSE. Given an input image, we use an image encoder to extract the pixel features to the same embedding space. By computing the similarity between the 3D vertex embeddings and the pixel embeddings, we can obtain the 2D-3D geometric correspondence. With such correspondence, we lift the 2D image texture to a 3D mesh, and project the textured mesh back to the 2D RGB object image, segmentation mask and depth using differentiable rendering. We use reconstruction losses by comparing them to the 2D ground-truths for training. However, the reconstruction tasks do not provide enough constraints on learning the high dimensional correspondence. To facilitate the optimization in training, we propose novel losses that establish cycle-consistency across 2D and 3D space. Within a single instance, our network can estimate the 2D-3D dense correspondence ϕ using CSE and the global rigid transformation π for projecting 3D shapes to 2D using another encoder. Given a 2D pixel, we can first find its corresponding 3D vertex with ϕ and then project it back to 2D with π, the projected location should be consistent with the starting pixel location. We can design a similar loss by forming the cycle starting from the 3D vertex. This provides an instance cycle-consistency loss for training. Beyond a single instance, we also design cycles that go across different object instances within the same category. Assuming given instance A and B, this cycle will include a forward pass across the 3D space, and a backward pass across the 2D space: (i) Forward pass: Starting from one 2D pixel in instance A, we can find its corresponding 3D vertex using ϕ. This 3D vertex can easily find its corresponding 3D point location in instance B given the mesh is defined in canonical space. The located 3D point is then projected back to 2D using π for instance B. (ii) Backward pass: We leverage the self-supervised pre-trained DINO feature (Caron et al., 2021) to find the 2D correspondence between two instances A and B. The located 2D pixel in instance B during the forward pass can find its corresponding location in instance A using the 2D correspondence, which provides a cross-instance cycle-consistency loss. The same formulation can be easily extended to videos, where we take A and B as the same instance across time, and this additionally provides a cross-time cycle-consistency loss. We conduct our experiments on the in-the-wild dataset Wild6D (Fu & Wang, 2022) . Surprisingly, our self-supervised approach that directly trains on the unlabeled data performs on par or even better than state-of-the-art approaches which leverage both 3D annotations as well as simulation. We visualize the 6D pose estimation and the geometric correspondence map in Figure 1 . Besides Wild6D, we also train and evaluate our model in the REAL275 (Wang et al., 2019b) dataset and shows competitive results with the fully supervised approaches. Finally, we evaluate the CSE representation on keypoint transfer tasks and achieve state-of-the-art results. We highlight our main contributions as follows: • To the best of our knowledge, this is the first work that allows self-supervised 6D pose estimation training in the wild. • We propose a framework that learns a novel Categorical Surface Embedding representation for 6D pose estimation. • We propose novel cycle-consistency losses for training the Categorical Surface Embedding.



† Work done while an intern at UC San Diego.



Figure 1: Examples of self-supervised category-level 6D pose estimation in the wild. We propose a novel Categorical Surface Embedding representation to learn categorical 2D-3D geometric correspondences. For each example, we visualize the object 6D pose and its correspondence map.

