NEMO: NEURAL MESH MODELS OF CONTRASTIVE FEATURES FOR ROBUST 3D POSE ESTIMATION

Abstract

3D pose estimation is a challenging but important task in computer vision. In this work, we show that standard deep learning approaches to 3D pose estimation are not robust when objects are partially occluded or viewed from a previously unseen pose. Inspired by the robustness of generative vision models to partial occlusion, we propose to integrate deep neural networks with 3D generative representations of objects into a unified neural architecture that we term NeMo. In particular, NeMo learns a generative model of neural feature activations at each vertex on a dense 3D mesh. Using differentiable rendering we estimate the 3D object pose by minimizing the reconstruction error between NeMo and the feature representation of the target image. To avoid local optima in the reconstruction loss, we train the feature extractor to maximize the distance between the individual feature representations on the mesh using contrastive learning. Our extensive experiments on PASCAL3D+, occluded-PASCAL3D+ and ObjectNet3D show that NeMo is much more robust to partial occlusion and unseen pose compared to standard deep networks, while retaining competitive performance on regular data. Interestingly, our experiments also show that NeMo performs reasonably well even when the mesh representation only crudely approximates the true object geometry with a cuboid, hence revealing that the detailed 3D geometry is not needed for accurate 3D pose estimation.

1. INTRODUCTION

Object pose estimation is a fundamentally important task in computer vision with a multitude of realworld applications, e.g. in self-driving cars, or partially autonomous surgical systems. Advances in the architecture design of deep convolutional neural networks (DCNNs) Tulsiani & Malik ( 2015 2018) increased the performance of computer vision systems at 3D pose estimation enormously. However, our experiment shows current 3D pose estimation approaches are not robust to partial occlusion and when objects are viewed from a previously unseen pose. This lack of robustness can have serious consequences in real-world applications and therefore needs to be addressed by the research community. In general, recent works follow either of two approaches for object pose estimation: Keypoint-based approaches detect a sparse set of keypoints and subsequently align a 3D object representation to the detection result. However, due to the sparsity of the keypoints, these approaches are highly vulnerable when the keypoint detection result is affected by adverse viewing conditions, such as partial occlusion. On the other hand, rendering-based approaches utilize a generative model, that is built on a dense 3D mesh representation of an object. They estimate the object pose by reconstructing the input image in a render-and-compare manner (Figure 1 ). While rendering-based approaches can be more robust to partial occlusion Egger et al. (2018) , their core limitation is that they model objects in terms of image intensities. Therefore, they pay too much attention to object details that are not relevant for the 3D pose estimation task. This makes them difficult to optimize Blanz & Vetter (2003); Schönborn et al. (2017) , and also requires a detailed mesh representation for every shape variant of an object class (e.g. they need several types of sedan meshes instead of using one prototypical type of sedan).



); Su et al. (2015); Mousavian et al. (2017); Zhou et al. (

availability

//github.com/Angtian/NeMo.

