NEMO: NEURAL MESH MODELS OF CONTRASTIVE FEATURES FOR ROBUST 3D POSE ESTIMATION

Abstract

3D pose estimation is a challenging but important task in computer vision. In this work, we show that standard deep learning approaches to 3D pose estimation are not robust when objects are partially occluded or viewed from a previously unseen pose. Inspired by the robustness of generative vision models to partial occlusion, we propose to integrate deep neural networks with 3D generative representations of objects into a unified neural architecture that we term NeMo. In particular, NeMo learns a generative model of neural feature activations at each vertex on a dense 3D mesh. Using differentiable rendering we estimate the 3D object pose by minimizing the reconstruction error between NeMo and the feature representation of the target image. To avoid local optima in the reconstruction loss, we train the feature extractor to maximize the distance between the individual feature representations on the mesh using contrastive learning. Our extensive experiments on PASCAL3D+, occluded-PASCAL3D+ and ObjectNet3D show that NeMo is much more robust to partial occlusion and unseen pose compared to standard deep networks, while retaining competitive performance on regular data. Interestingly, our experiments also show that NeMo performs reasonably well even when the mesh representation only crudely approximates the true object geometry with a cuboid, hence revealing that the detailed 3D geometry is not needed for accurate 3D pose estimation. The code is publicly available at https://github.com/Angtian/NeMo.

1. INTRODUCTION

Object pose estimation is a fundamentally important task in computer vision with a multitude of realworld applications, e.g. in self-driving cars, or partially autonomous surgical systems. Advances in the architecture design of deep convolutional neural networks (DCNNs) Tulsiani 2018) increased the performance of computer vision systems at 3D pose estimation enormously. However, our experiment shows current 3D pose estimation approaches are not robust to partial occlusion and when objects are viewed from a previously unseen pose. This lack of robustness can have serious consequences in real-world applications and therefore needs to be addressed by the research community. In general, recent works follow either of two approaches for object pose estimation: Keypoint-based approaches detect a sparse set of keypoints and subsequently align a 3D object representation to the detection result. However, due to the sparsity of the keypoints, these approaches are highly vulnerable when the keypoint detection result is affected by adverse viewing conditions, such as partial occlusion. On the other hand, rendering-based approaches utilize a generative model, that is built on a dense 3D mesh representation of an object. They estimate the object pose by reconstructing the input image in a render-and-compare manner (Figure 1 ). While rendering-based approaches can be more robust to partial occlusion Egger et al. (2018) , their core limitation is that they model objects in terms of image intensities. Therefore, they pay too much attention to object details that are not relevant for the 3D pose estimation task. This makes them difficult to optimize Blanz & Vetter (2003) ; Schönborn et al. (2017) , and also requires a detailed mesh representation for every shape variant of an object class (e.g. they need several types of sedan meshes instead of using one prototypical type of sedan). Figure 1 : Traditional render-and-compare approaches render RGB images and make pixel-level comparisons. These are difficult to optimize due to the many local optima in the pixel-wise reconstruction loss. In contrast, NeMo is a Neural Mesh Model that renders feature maps and compares them with feature maps obtained via CNN backbone. The invariance of the neural features to nuisance variables, such as shape and color variations, enables a robust 3D pose estimation with simple gradient-descent optimization of the neural reconstruction loss. In this work, we introduce NeMo a rendering-based approach to 3D pose estimation that is highly robust to partial occlusion, while also being able to generalize to previously unseen views. Our key idea is to learn a generative model of an object category in terms of neural feature activations, instead of image intensities (Figure 1 ). In particular, NeMo is composed of a prototypical mesh representation of the object category and feature representations at each vertex of the mesh. The feature representations are learned to be invariant to instance specific details (such as shape and color variations) that are not relevant for the 3D pose estimation task. Specifically, we use contrastive learning He et al. (2020) ; Wu et al. (2018) ; Bai et al. (2020) to ensure that the extracted features of an object are distinct from each other (e.g. the features of the front tire of a car are different from those of the back tire), while also being distinct from non-object features in the background. Furthermore, we train a generative model of the feature activations at every vertex of the mesh representation. During inference, NeMo estimates the object pose by reconstructing a target feature map with using render-and-compare and gradient-based optimization w.r.t. the 3D object pose parameters. We evaluate NeMo at 3D pose estimation on the PASCAL3D+ Xiang et al. (2014) and the Ob-jectNet3D Xiang et al. (2016) dataset. Both datasets contain a variety of rigid objects and their corresponding 3D CAD models. Our experimental results show that NeMo outperforms popular approaches such as Starmap Zhou et al. (2018) at 3D pose estimation by a wide margin under partial occlusion, and performs comparably when the objects are not occluded. Moreover, NeMo is exceptionally robust when objects are seen from a viewpoint that is not present in the training data. Interestingly, we also find that the mesh representation in NeMo can simply approximate the true object geometry with a cuboid, and still perform very well. Our main contributions are: 1. We propose a 3D neural mesh model of objects that is generative in terms of contrastive neural network features. This representation combines a prototypical geometric representation of the object category with a generative model of neural network features that are invariant to irrelevant object details. 2. We demonstrate that standard deep learning approaches to 3D pose estimation are highly sensitive to out-of-distribution data including partial occlusions and unseen poses. In contrast, NeMo performs 3D pose estimation with exceptional robustness. 3. In contrast to other rendering-based approaches that require instance-specific mesh representations of the target objects, we show that NeMo achieves a highly competitive 3D pose estimation performance even with a very crude prototypical approximation of the object geometry using a cuboid. 

3. NEMO: A 3D GENERATIVE MODEL OF NEURAL FEATURES

We denote a feature representation of an input image I as Φ(I) = F l ∈ R H×W ×D . Where l is the output of a layer l of a DCNN Φ, with D being the number of channels in layer l. f l i ∈ R D is a feaure vector in F l at position i on the 2D lattice P of the feature map. In the remainder of this section we omit the superscript l for notational simplicity because this is fixed a-priori in our model.

3.1. NEURAL RENDERING OF FEATURE MAPS

Similar to other graphics-based generative models, such as e.g. 3D morphable models Blanz & Vetter (1999) ; Egger et al. (2018) , our model builds on a 3D mesh representation that is composed of a set of 3D vertices Γ = {r ∈ R 3 |r = 1, . . . , R}. For now, we assume the object mesh to be given at training time but we will relax this assumption in later sections. Different from standard graphics-based generative models, we do not store RGB values at each mesh vertex r but instead store feature vectors Θ = {θ r ∈ R D |r = 1, . . . , R}. Using standard rendering techniques, we can use this 3D neural mesh model N = {Γ, Θ} to render feature maps: F (m) = (N, m) ∈ R H×W ×D , where m are the camera parameters for projecting the neural mesh representation (Figure 2 ).

3.2. NEURAL MESH MODELS

Neural Mesh Models are probabilistic generative models of neural feature activations. Hence, our goal is to learn a generative model p(F |N y ) of the real-valued feature activations F of an object class y by leveraging a 3D neural mesh representation N y . Assuming that the 3D pose m of the object in the input image is known, we define the likelihood of the feature representation F as: p(F |N y , m, B) = i∈F G p(f i |N y , m) i ∈BG p(f i |B). The foreground FG is the set of all positions on the 2D lattice P of the feature map F that are covered by the rendered neural mesh model. We compute FG by projecting the 3D vertices of the mesh model Γ y into the image using the ground truth camera pose m to obtain the 2D locations of the visible vertices in the image FG = {s t ∈ R 2 |t = 1, . . . , T }. We define foreground feature likelihoods to be Gaussian distributed: p(f i |N y , m) = 1 σ r √ 2π exp - 1 2σ 2 r f i -θ r 2 . ( ) Note that the correspondence between the feature vector f i in the feature map F and the vector θ r on the neural mesh model is given by the 2D projection of N y with camera parameters m. Those features that are not covered by the neural mesh model BG = P \ {F G}, i.e. are located in the background, are modeled by a Gaussian likelihood: p(f i |B) = 1 σ √ 2π exp - 1 2σ 2 f i -β 2 , with mixture parameters B = {β, σ}.

3.3. TRAINING USING MAXIMUM LIKELIHOOD AND CONTRASTIVE LEARNING

During training we want to optimize two objectives: 1) The parameters of the generative model as defined in Equation 2 should be optimized to achieve maxmimum likelihood on the training data. 2) The backbone used for feature extraction ψ should be optimized to make the individual feature vectors as disctinct from each other as possible. Maximum likelihood estimation of the generative model. We optimize the parameters of the generative model to minimize the negative log-likelihood of our model (Equation 2): L M L (F, N y , m, B) = -ln p(F |N y , m, B) = - i∈F G ln 1 σ r √ 2π - 1 2σ 2 r f i -θ r 2 + i ∈BG ln 1 σ √ 2π - 1 2σ 2 f i -β 2 If we constrain the variances such that {σ 2 = σ 2 r = 1|∀r} then the maximum likelihood loss reduces to: L M L (F, N y , m, B) = -C i∈F G f i -θ r 2 + i ∈BG f i -β 2 , ( ) where C is a constant scalar. Contrastive learning of the feature extractor. The general idea of the contrastive loss is to train the feature extractor such that the individual feature vectors on the object are distinct from each other, as well as from the background: L F eature (F, FG) = - i∈F G i ∈F G\{i} f i -f i 2 (9) L Back (F, FG, BG) = - i∈F G j∈BG f i -f j 2 . ( ) The contrastive feature loss L F eature encourages the features on the object to be distinct from each other (e.g. the feature vectors at the front tire of a car are different from those of the back tire). The contrastive background loss L Back encourages the features on the object to be distinct from the features in the background. The overall loss used to train NeMo is: Using these two maps, we do the occlusion inference to segment image into foreground region and background region. Then, we calculate reconstruction loss and optimize object pose via minimize the loss. We also visualize the loss landscape along all 3 object pose parameters, and the final pose prediction. L(F, N y , m, B) = L M L (F, N y , m, B) + L F eature (F, FG) + L Back (F, FG, BG)

3.4. ROBUST 3D POSE ESTIMATION WITH RENDER AND COMPARE

After training the feature extractor and the generative model in NeMo, we apply the model for estimating the camera pose parameters b. In particular we aim to optimize the model likelihood from Equation 2 w.r.t. to the camera parameters in a render-and-compare manner. Following related work on robust inference with generative models Kortylewski (2017); Egger et al. (2018) we optimize a robust model likelihood: p(F |N y , m, B, z i ) = i∈F G [p(f i |N y , m)p(z i =1)] zi [p(f i |B)p(z i =0)] (1-zi) i ∈BG p(f i |B). Here z i ∈ {0, 1} is a binary variable and p(z i =1) and p(z i =0) are the prior probabilities of the respective values. Here z i is a binary variable that allows the background model p(f i |B) to explain those locations in the feature map F that are in the foreground region FG, but which the foreground model (f i |N y , m) cannot explain well. A primary purpose of this mechanism is to make the cost function robust to partial occlusion. Figure 2 illustrates the inference process. Given an initial camera pose estimate we use the Neural Mesh Model to render a feature map F and evaluate the reconstruction loss in the foreground region FG (foreground score map), as well as the reconstruction error when using the background model only (background score map). Pixel-wised comparison of foreground score and background score yield the occlusion map Z = {z i ∈ {0, 1}|∀i ∈ P}. The map Z indicates where feature vectors are explained by either the foreground or background model. A fundamental benefit of our Neural Mesh Models is that, they are generative on the level of neural feature activations. This makes the overall reconstruction loss very smooth compared to related works who are generative on the pixel level. Therefore, NeMo can be optimized w.r.t. the pose parameters with standard stochastic gradient descent. We visualize the loss as a function of the individual pose parameters in Figure 2 . Note that the losses are generally very smooth and contain one clear global optimum. This is in stark contrast to the optimization of classic generative models at the level of RGB pixels, which often requires complex hand designed initialization and optimization procedures to avoid the many local optima of the reconstruction loss Blanz 

4. EXPERIMENT

We first describe the experimental setup in Section 4.1. Subsequently, we study the performance of NeMo at 3D pose estimation in Section 4.2 and study the effect of crudely approximating the object geometry within NeMo single 3d cuboid, that one cuboid represent all object in each category, and multiple 3d cuboid, that one cuboid represent only one subtype of object in each category. We ablate the important modules of our model in Section 4.4.

4.1. EXPERIMENTAL SETUP

Evaluation. The task of 3D object pose estimation involves the prediction of three rotation parameters (azimuth, elevation, in-plane rotation) of an object relative to the camera. In our eval-  ∆ (R pred , R gt ) = log m(R T pred Rgt) F √ 2 . We report two commonly used evaluation metrics the median of the rotation error, the percentage of predicted angles within a given accuracy threshold. Specifically, we use the thresholds pi Training Setup. In the training process, we use the 3D meshes (see Section 4.2 for experiments without the mesh geometry), the locations and scales of objects, and the 3D poses. We use Blender Community (2018) to reduce the resolution of the mesh because the meshes provided in PAS-CAL3D+ have a very high number of vertices. In order to balance the performance and computational cost, in particular the cost of the rendering process, we limit the size of the feature map produced by backbone to 1 8 of the input image. To achieve this, we use the ResNet50 with two additional upsample layers as our backbone. We train a backbone for each category separately, and learn a Neural Mesh Model for each subtype in a category. We follow hyperparameter settings from Bai et al. (2020) for the contrastive loss. We train NeMo for 800 training epochs with a batch size of 108, which takes around 3 to 5 hours to train a category using 6 NVIDIA RTX Titan GPUs. Baselines. We compare our model to StarMap Zhou et al. (2018) using their official implementation and training setup. Following common practice, we also evaluate a popular baseline that formu-Table 1 : Pose estimation results on PASCAL3D+ and the occluded PASCAL3D+ dataset. Occlusion level L0 are the orignal images from PASCAL3D+, while Occlusion Level L1 to L3 are the occluded PASCAL3D+ images with increasing occlusion ratio. We evaluate both baseline and NeMo using Accuracy (percentage, higher better) and Median Error (degree, lower better). Note that NeMo is exceptionally robust to partial occlusion. lates pose estimation as a classification problem. In particular, we evaluate the performance of a deep neural network classifier that uses the same backbone as NeMo. We train a category specific Resnet50 (Res50-Specific), which formulates the pose estimation in each category as an individual classification problem. Furthermore, we train a non-specific Resnet50 (Res50-General), which performs pose estimation for all categories in a single classification task. We report the result of both architectures using the implementation provided by Zhou et al. (2018) . Inference via Feature-level Rendering. We implement the NeMo inference pipeline (see 3.4) using PyTorch3D Ravi et al. (2020) . Specifically, we render the Neural Mesh Models into feature map F using the feature representations Θ stored at each mesh vertex. We estimate the object pose by minimizing the reconstruction loss as introduced in Equation 12. For initialization of pose optimization, we uniformly sample 144 different poses (12 azimuth angles, 4 elevation angles, 3 in-plane rotations respectively). Then we pick the initial pose with minimum reconstruction loss as a starting point of optimization (optimization conduct with only the chosen pose, others will be deprecated). On average each image takes about 8s with a single GPU for inference. The whole inference process on PASCAL3D+ takes about 3 hours using a 8 GPU machine.

4.2. ROBUST 3D POSE ESTIMATION UNDER OCCLUSION

Baseline performances. Table 1 (for categories specific scores, see 6) illustrates the 3D pose estimation results on PASCAL3D+ under different levels of occlusion. In the low accuracy setting (ACC π 6 ) StarMap performs exceptionally well when the object is non-occluded (L0). However, with increasing level of partial occlusion, the performance of StarMap degrades massively, falling even below the basic classification models Res50-General and Res50-Specific. These results highlight that today's most common deep networks for 3D pose estimation are not robust. Similar, generalization patterns can be observed for the high accuracy setting (ACC π 18 ). However, we can observe that the classification baselines do not perform as well as before, and hence are not well suited for fine-grained 3D pose estimation. Nevertheless, they outperform StarMap at high occlusion levels (L2 & L3). NeMo. We evaluate NeMo in three different setups: NeMo uses a down-sampled object mesh as geometry representation, NeMo-MultiCuboid and NeMo-SingleCuboid approximate the 3D object geometry crudely using 3D cuboid boxes. We discuss the cuboid generation and results in detail in the next paragraph. Compared to the baseline performances, we observe that NeMo achieves competitive performance at estimating the 3D pose of non-occluded objects. Moreover, NeMo is much more robust compared to all baseline approaches. In particular, we observe that NeMo achieves the highest performance at every evaluation metric when the objects are partially occluded. Note that the training data for all models is exactly the same. To further investigate and understand the robustness of NeMo, we qualitatively analyze the pose estimation and occluder location predictions of NeMo in Figure 3 . Each subfigure shows the input image, the pose estimation result, the occluder localization map and the loss as a function of the pose angles. We visualize the loss landscape along each pose parameter (azimuth, elevation and in-plane rotation) by sampling the individual parameters in a fixed step size, while keeping all other parameters at their ground-truth value. We further split the binary occlusion map Z into three regions to highlight the occluder localization performance of NeMo. In particular, we split the region that is explained by the background model into a yellow and a red region. The red region is covered by rendered mesh and highlights the locations with the projected region of the mesh, which the neural mesh model cannot explain well. Hence these mark the locations in the image that NeMo predicts to be occluded. From the qualitative illustrations, we observe that NeMo maintains high robustness even under extreme occlusion, when only a small part of the object is visible. Furthermore, we can clearly see that NeMo can approximately localize the occluders. This occluder localization property of NeMo makes our model not just more robust but also much more human-interpretable compared to standard deep network approaches. NeMo without detailed object mesh. We approximate the object geometry in NeMo by replacing the downsampled mesh with 3D cuboid boxes (see Figure 5 ). The vertices of the cuboid meshes are evenly distributed on all six sides of the cuboid. For generating the cuboids, we use three constraint: 1) The cuboid should cover all the vertices of the original mesh with minimum volume; 2) The distances between each pair of adjacent vertices should be similar; 3) The total number of vertices for each mesh should be around 1200 vertices. We generate two different types of models. NeMo-MultiCuboid uses a separate cuboid for each object mesh in an object category, while NeMo-SingleCuboid uses on cuboid for all instances of a category. We report the pose estimations results with NeMo using cuboid meshes in Table 1 . The results show that approximating the detailed mesh representations of a category with a single 3D cuboid gives surprisingly good results. In particular, NeMo-SingleCuboid even often outperforms our standard model. This shows that generative models of neural network feature activations must not retain the detailed object geometry, because the feature activations are invariant to detailed shape properties. Moreover, NeMo-MultiCube outperforms the SingleCube model significantly. This suggests that for some categories the size between different sub-types can be very differnt (e.g. for the airplane class it could be a passanger jet or a fighter jet). Therefore, a single mesh may not be representative enough for some object categories. The MultiCuboid model even outperforms our the model with detailed mesh geometry. This is very likely caused by difficulties during the down-sampling of the original meshes in PASCAL3D+, which might remove important parts of the object geometry. We also conduct experiment on ObjectNet3D dataset, which reported in Table 2 . The result demonstrates that NeMo outperforms StarMap in 14 categories out of all 18 categories. Note that due to the considerable number of occluded and truncated images in ObjectNet3D dataset, this dataset is significantly harder than PASCAL3D+, however, NeMo still demonstrates reasonable accuracy.

4.3. GENERALIZATION TO UNSEEN VIEWS

To further investigate robustness of NeMo to out-of-distribution data, we evaluate the performance of NeMo when objects are observed from previously unseen viewpoints. For this, we split the PASCAL3D+ dataset into two sets based on the ground-truth azimuth angle. In particular, we use the front and rear views for training. We evaluate all approaches on the full testing set and split the performance into seen (front and rear) and unseen (side) poses. The histogram on the left of Table 4 shows the distribution of ground-truth azimuth angles in the PASCAL3D+ test dataset. The seentest-set contains 7305 images while the unseen-test-set contains 3507 images. Table 4 shows that NeMo can significantly better generalize to novel viewpoints compared to the baselines. For some categories the accuracy of NeMo on the unseen-test-set is even comparable to seen-test-set (Table 7 ). These results highlight the importance of building neural networks with 3D internal representations, which enable them to generalize exceptionally well to unseen 3D transformations.

4.4. ABLATION STUDY

In Table 4 , we study the effect of each individual module of NeMo. Specifically, we remove the clutter feature, background score and occluder prediction during inference, and only use foreground score to calculate pose loss. This reduces the robustness to occlusion significantly. Furthermore, we remove the contrastive loss and use neural features that were extracted with an ImageNet-pretrained Resnet50 with non-parametric-upsampling. This leads to a massive decrease in performance, and hence highlights the importance of learning locally distinct feature representations. Table 5 (and Table 10 ) study the sensitivity of NeMo to the random pose initialization before the pose optimization. In this ablation, we evaluate NeMo-MultiCuiboid with 144 down to 1 uniformly sampled initialization poses. Note that we do not run 144 optimization processes. We instead evaluate the reconstruction error for each initialization and start the optimization from the initializaiton with the lowest error. Hence, every experiment only involves one optimization run. The results demonstrate that NeMo benefits from the smooth lose landscape. With 6 initial samples NeMo achieves a reasonable performance, while 72 initial poses almost yield the maximum performance. This ablation clearly highlights that, unlike standard Render-and-Compare approaches Blanz & Vetter (1999); Schönborn et al. (2017), NeMo does not require complex designed initialization strategies.

5. CONCLUSION

In this work, we considered the problem of robust 3D pose estimation with neural networks. We found that standard deep learning approaches do not give robust predictions when objects are partially occluded or viewed from an unseen pose. In an effort to resolve this fundamental limitation we developed Neural Mesh Models (NeMo), a neural network architecture that integrates a prototypical mesh representation with a generative model of neural features. We combine NeMo with contrastive learning and show that this makes possible to estimate the 3D pose with very high robustness to out-of-distribution data using simple gradient-based render-and-compare. Our experiments demonstrate the superiority of NeMo compared to related work on a range of challenging datasets. A APPENDIX 



& Malik (2015); Su et al. (2015); Mousavian et al. (2017); Zhou et al. (

Figure 2: Overview of pose estimation: For each image, we use the trained CNN backbone to extract feature map F . Meanwhile, using trained Neural Mesh Model and randomly initialized object pose, we can render a feature map F . By calculating similarity at each local of F and F , we can create a foreground score map, which demonstrate the object likelihood at each location. Similarly, we can get a background score map via F and trained clutter model β. Using these two maps, we do the occlusion inference to segment image into foreground region and background region. Then, we calculate reconstruction loss and optimize object pose via minimize the loss. We also visualize the loss landscape along all 3 object pose parameters, and the final pose prediction.

Figure 3: Qualitative results of NeMo on PASCAL3D+ (L0) and occluded PASCAL3D+ (L1 & L2 & L3) for different categories under different occlusion level. For each example, we show four subfigures. Top-left: the input image; Top-right: A mesh superimposed on the input image in the predicted 3D pose. Bottom-left: The occluder localization result, where yellow is background, green is the non-occluded area of the object and red is the occluded area as predicted by NeMo. Bottomright: The loss landscape for each individual camera parameter respectively. The colored vertical lines demonstrate the final prediction and the ground-truth parameter is at center of x-axis.

and pi 18 . Following Zhou et al. (2018), we assume the centers and scales of the objects are given in all experiments. Datasets. We evaluate NeMo on both the PASCAL3D+ dataset Xiang et al. (2014) and the occluded PASCAL3D+ dataset Wang et al. (2020). PASCAL3D+ contains 12 man-made object categories with 3D pose annotations and 3D meshes for each category respectively. We follow Wang et al. (2020) and Bai et al. (2020) to split the PASCAL3D+ into a training set with 11045 images and validation set with 10812 images. The occluded PASCAL3D+ dataset is a benchmark to evaluate robustness under occlusion. This dataset simulates realistic man-made occlusion by artificially superimposing occluders collected from the MS-COCO dataset Lin et al. (2014) on objects in PAS-CAL3D+. The dataset contains all 12 classes of objects from PASCAL3D+ dataset with three levels of occlusion, where L1: 20-40%, L2: 40-60%, L3: 60-80% of the object area are occluded. We further test NeMo on the ObjectNet3D dataset Xiang et al. (2016), which is also a category-level 3D pose estimation benchmark. ObjectNet3D contains 100 different categories with 3D meshes, it contains totally 17101 training samples and 19604 testing samples, including 3556 occluded or truncated testing samples. Following Zhou et al. (2018), we report pose estimation results on 18 categories. Note that different from StarMap, we use all images during evaluation, including occluded or truncated samples.

Figure 5: Using detailed mesh model we can create all type of mesh models for NeMo. (a) We use remesh method in Blender to down sample the original mesh. The processed mesh contains 1722 vertices. (b) Following rules in 4.2, we create subtype specificed cuboid (one cuboid for each subtype), which used in NeMo-MultiCuboid approach. The cuboid contains 1096 vertices. (c) We create the subtype general cuboid by requiring the cuboid cover original meshes of all subtypes.And we use the created cuboid to represent all objects in this category, which reported as NeMo-SingleCuboid. This cuboid contains 1080 vertices.

Figure 6: Visualization of failure case of NeMo on occluded PASCAL3D+. For each example, we show four subfigures. Top-left: the input image; Top-right: A mesh superimposed on the input image in the predicted 3D pose. Bottom-left: The occluder localization result, where yellow is background, green is the non-occluded area of the object and red is the occluded area as predicted by NeMo. Bottom-right: The loss landscape for each individual camera parameter respectively. The colored vertical lines demonstrate the final prediction and the ground-truth parameter is at center of x-axis.

Lepetit et al. (2009).Zhou et al. (2018) further improved this approach by utlizing depth information. Recent workWang et al. (2019); Chen et al. (2020) introduced render-andcompare to for category-level pose estimation. However, both approaches used pixel-level image synthesis and required detailed mesh models during training.In contrast, NeMo preforms renderand-compare on the level of contrastive features, which are invariant to intra-category nuisances, such as shape and color variations. This enables NeMo to achieve accurate 3D pose estimation results even with a crude prototypical category-level mesh representation.PoseEstimation under Partial Occlusion. Keypoint-based pose estimation methods are sensitive to outliers, which can be caused by partial occlusion Pavlakos et al. (2017); Sundermeyer et al. (2018). Some rendering-based approaches achieve satisfactory results on instance-level pose estimation under partial occlusion Song et al. (2020); Peng et al. (2019); Zakharov et al. (2019); Li et al. (2018). However, these approaches render RGB images or use instance-level constraints, e.g. pixellevel voting, to estimate object pose. Therefore, these approaches are not suited for category-level pose estimation. To the best of our knowledge, NeMo is the first approach that performs categorylevel pose estimation robustly under partial occlusion.Contrastive Feature Learning. Contrastive learning is widely used in deep learning research.

Pose estimation results of ObjectNet3D. Evaluated via pose estimation accuracy percentage for error under π 6 (higher better). Both baseline and NeMo evaluated on all images of given category, including occluded and truncated. Overall, NeMo has higher accuracy in 14 categories while lower in 4 categories.

Pose estimation results on PASCAL3D+ for objects in seen and unseen poses. The histogram on the left shows how we separate the PASCAL3D+ test dataset into subsets based on the azimuth pose of the object. We have similarly split the training dataset and trained all models only on the "seen" subset. We evaluate on both test sets (Seen & Unseen). Note the strong generalization performance of NeMo in unseen view points.

Ablation study on PASCAL3D+ and occluded PASCAL3D+. All ablation experiments are conducted with the NeMo-MultiCuboid model. The performance is reported in terms of Accuracy (percentage, higher better) and Median Error (degree, lower better).

Sensitivity of NeMo-MultiCuboid under different numbers of pose initializations during inference (Init Samples) on PASCAL3D+.

Pose estimation results on PASCAL3D+ (L0) for all categories respectively. Results reported in Accuracy (percentage, higher better) and Median Error (degree, lower better). MedErr NeMo-MultiCuboid 11.8 13.4 14.8 10.2 2.6 2.8 10.1 8.8 14.0 7.0 5.0 8.1 8.2 ↓ MedErr NeMo-SingleCuboid 10.1 16.3 14.9 10.2 3.2 3.2 10.1 9.3 14.1 8.6 5.4 12.2 8.8

Pose estimation results on occluded PASCAL3D+ occlusion L1 for all categories respectively. Results reported in Accuracy (percentage, higher better) and Median Error (degree, lower better).

Pose estimation results on occluded PASCAL3D+ occlusion L2 for all categories respectively. Results reported in Accuracy (percentage, higher better) and Median Error (degree, lower better). MedErr NeMo-MultiCuboid 38.5 26.4 38.2 18.8 7.0 7.3 23.0 23.0 36.0 14.0 14.9 16.1 20.2 ↓ MedErr NeMo-SingleCuboid 39.9 30.6 38.8 19.5 8.3 7.8 21.3 24.8 29.5 14.2 16.9 18.5 20.9

Pose estimation results on occluded PASCAL3D+ occlusion L3 for all categories respectively. Results reported in Accuracy (percentage, higher better) and Median Error (degree, lower better). MedErr NeMo-MultiCuboid 69.8 49.6 63.0 28.2 19.4 14.9 35.4 39.9 60.0 23.7 38.1 27.2 36.1 ↓ MedErr NeMo-SingleCuboid 74.8 46.1 70.1 24.5 30.2 16.3 35.2 37.5 50.5 21.5 31.7 29.9 36.5

Full table for 5. This table shows category specific results of NeMo-MultiCuboid pose estimation performance on PASCAL3D+ using different number of initialization pose during inference. The Init Samples shows total number of initialization pose e.g. 144 means we uniformly sample 12(azimuth) * 4(elevation) * 3(in-plane rotation) poses. Std. mean this setting is standard settings and used in main experiment.

Experiment for NeMo-MultiCuboid when subtype is not given during inference. In the w/o subtype experiment we use NMMs of all subtypes to do inference on each image respectively, then pick the predicted pose with subtypes with minimum reconstruction loss. The result demonstrate that distinguishing subtypes is not necessary for pose estimation with NeMo.

Pose estimation results on PASCAL3D+ under unseen pose for CAR category. Figure shows the distribution of azimuth in PASCAL3D+ testing set of car category and our splitting.

ACKNOWLEDGMENTS

We gratefully acknowledge funding support from ONR N00014-18-1-2119, ONR N00014-20-1-2206, the Institute for Assured Autonomy at JHU with Grant IAA 80052272, and the Swiss National Science Foundation with Grant P2BSP2.181713. We also thank Weichao Qiu, Qing Liu, Yutong Bai and Jiteng Mu for suggestions on our paper.

