NEMO: NEURAL MESH MODELS OF CONTRASTIVE FEATURES FOR ROBUST 3D POSE ESTIMATION

Abstract

3D pose estimation is a challenging but important task in computer vision. In this work, we show that standard deep learning approaches to 3D pose estimation are not robust when objects are partially occluded or viewed from a previously unseen pose. Inspired by the robustness of generative vision models to partial occlusion, we propose to integrate deep neural networks with 3D generative representations of objects into a unified neural architecture that we term NeMo. In particular, NeMo learns a generative model of neural feature activations at each vertex on a dense 3D mesh. Using differentiable rendering we estimate the 3D object pose by minimizing the reconstruction error between NeMo and the feature representation of the target image. To avoid local optima in the reconstruction loss, we train the feature extractor to maximize the distance between the individual feature representations on the mesh using contrastive learning. Our extensive experiments on PASCAL3D+, occluded-PASCAL3D+ and ObjectNet3D show that NeMo is much more robust to partial occlusion and unseen pose compared to standard deep networks, while retaining competitive performance on regular data. Interestingly, our experiments also show that NeMo performs reasonably well even when the mesh representation only crudely approximates the true object geometry with a cuboid, hence revealing that the detailed 3D geometry is not needed for accurate 3D pose estimation.

1. INTRODUCTION

Object pose estimation is a fundamentally important task in computer vision with a multitude of realworld applications, e.g. in self-driving cars, or partially autonomous surgical systems. Advances in the architecture design of deep convolutional neural networks (DCNNs) Tulsiani & Malik (2015) ; Su et al. (2015) ; Mousavian et al. (2017) ; Zhou et al. (2018) increased the performance of computer vision systems at 3D pose estimation enormously. However, our experiment shows current 3D pose estimation approaches are not robust to partial occlusion and when objects are viewed from a previously unseen pose. This lack of robustness can have serious consequences in real-world applications and therefore needs to be addressed by the research community. In general, recent works follow either of two approaches for object pose estimation: Keypoint-based approaches detect a sparse set of keypoints and subsequently align a 3D object representation to the detection result. However, due to the sparsity of the keypoints, these approaches are highly vulnerable when the keypoint detection result is affected by adverse viewing conditions, such as partial occlusion. On the other hand, rendering-based approaches utilize a generative model, that is built on a dense 3D mesh representation of an object. They estimate the object pose by reconstructing the input image in a render-and-compare manner (Figure 1 ). While rendering-based approaches can be more robust to partial occlusion Egger et al. (2018) , their core limitation is that they model objects in terms of image intensities. Therefore, they pay too much attention to object details that are not relevant for the 3D pose estimation task. This makes them difficult to optimize Blanz & Vetter (2003) ; Schönborn et al. (2017) , and also requires a detailed mesh representation for every shape variant of an object class (e.g. they need several types of sedan meshes instead of using one prototypical type of sedan). In this work, we introduce NeMo a rendering-based approach to 3D pose estimation that is highly robust to partial occlusion, while also being able to generalize to previously unseen views. Our key idea is to learn a generative model of an object category in terms of neural feature activations, instead of image intensities (Figure 1 ). In particular, NeMo is composed of a prototypical mesh representation of the object category and feature representations at each vertex of the mesh. The feature representations are learned to be invariant to instance specific details (such as shape and color variations) that are not relevant for the 3D pose estimation task. Specifically, we use contrastive learning He et al. ( 2020 2018) at 3D pose estimation by a wide margin under partial occlusion, and performs comparably when the objects are not occluded. Moreover, NeMo is exceptionally robust when objects are seen from a viewpoint that is not present in the training data. Interestingly, we also find that the mesh representation in NeMo can simply approximate the true object geometry with a cuboid, and still perform very well. Our main contributions are: 1. We propose a 3D neural mesh model of objects that is generative in terms of contrastive neural network features. This representation combines a prototypical geometric representation of the object category with a generative model of neural network features that are invariant to irrelevant object details. 2. We demonstrate that standard deep learning approaches to 3D pose estimation are highly sensitive to out-of-distribution data including partial occlusions and unseen poses. In contrast, NeMo performs 3D pose estimation with exceptional robustness. 3. In contrast to other rendering-based approaches that require instance-specific mesh representations of the target objects, we show that NeMo achieves a highly competitive 3D pose estimation performance even with a very crude prototypical approximation of the object geometry using a cuboid.

2. RELATED WORK

Category-Level Object Pose Estimation. Category-Level object pose estimation has been well explored by the research community. A classical approach as proposes by Tulsiani & Malik (2015) and Mousavian et al. (2017) was to formulate object pose estimation as a classification problem. Another common category-level object pose estimation approach involves a two-step process Szeto & Corso (2017); Pavlakos et al. ( 2017): First, semantic keypoints are detected interdependently and subsequently a Perspective-n-Point problem is solved to find the optimal 3D pose of an object mesh



Figure1: Traditional render-and-compare approaches render RGB images and make pixel-level comparisons. These are difficult to optimize due to the many local optima in the pixel-wise reconstruction loss. In contrast, NeMo is a Neural Mesh Model that renders feature maps and compares them with feature maps obtained via CNN backbone. The invariance of the neural features to nuisance variables, such as shape and color variations, enables a robust 3D pose estimation with simple gradient-descent optimization of the neural reconstruction loss.

); Wu et al. (2018); Bai et al. (2020) to ensure that the extracted features of an object are distinct from each other (e.g. the features of the front tire of a car are different from those of the back tire), while also being distinct from non-object features in the background. Furthermore, we train a generative model of the feature activations at every vertex of the mesh representation. During inference, NeMo estimates the object pose by reconstructing a target feature map with using render-and-compare and gradient-based optimization w.r.t. the 3D object pose parameters. We evaluate NeMo at 3D pose estimation on the PASCAL3D+ Xiang et al. (2014) and the Ob-jectNet3D Xiang et al. (2016) dataset. Both datasets contain a variety of rigid objects and their corresponding 3D CAD models. Our experimental results show that NeMo outperforms popular approaches such as Starmap Zhou et al. (

availability

//github.com/Angtian/NeMo.

