3D NEURAL EMBEDDING LIKELIHOOD FOR ROBUST SIM-TO-REAL TRANSFER IN INVERSE GRAPHICS Anonymous

Abstract

A central challenge in 3D scene perception via inverse graphics is robustly modeling the gap between 3D graphics and real-world data. We propose a novel 3D Neural Embedding Likelihood (3DNEL) over RGB-D images to address this gap. 3DNEL uses neural embeddings to predict 2D-3D correspondences from RGB and combines this with depth in a principled manner. 3DNEL is trained entirely from synthetic images and generalizes to real-world data. To showcase this capability, we develop a multi-stage inverse graphics pipeline that uses 3DNEL for 6D object pose estimation from real RGB-D images. Our method outperforms the previous state-of-the-art in sim-to-real pose estimation on the YCB-Video dataset, and improves robustness, with significantly fewer large-error predictions. Unlike existing bottom-up, discriminative approaches that are specialized for pose estimation, 3DNEL adopts a probabilistic generative formulation that jointly models multi-object scenes. This generative formulation enables easy extension of 3DNEL to additional tasks like object and camera tracking from video, using principled inference in the same probabilistic model without task specific retraining.

1. INTRODUCTION

There is a widespread need for models that bridge the gap between 3D graphics and real RGB-D data. Accurate simulation environments exist in domains such as autonomous driving, augmented reality, and robotic manipulation, yet robust 3D scene perception remains a central bottleneck. "Inverse graphics" is an appealing approach to 3D scene understanding that treats scene perception as the inverse problem to 3D graphics. However, in practice these methods have been outperformed by more bottom-up, discriminative approaches, especially those using deep learning (LeCun et al., 2015) . A key challenge in inverse graphics is modeling the gap between rendered images and observed real-world images. This paper aims to address this gap with 3D Neural Embedding Likelihood (3DNEL), a likelihood model of RGB-D images that is trained entirely from synthetic data and generalizes to real-world data. 3DNEL uses learned neural embeddings to predict dense 2D-3D correspondences from RGB and combines this with 3D information from depth in a principled way. To showcase 3DNEL's capabilities in sim-to-real transfer, we develop a multi-stage inverse graphics pipeline (MSIGP) that uses 3DNEL for 6D object pose estimation. We demonstrate that 3DNEL can be applied to real RGB-D images without having to train on any real data. Our 3DNEL MSIGP consists of (1) a coarse enumerative procedure that generates pose hypotheses and an initial estimate of the 3D scene and (2) an iterative Markov Chain Monte Carlo (MCMC) process that finetunes the 3D scene. We empirically evaluate 3DNEL MSIGP on the popular YCB-Video (YCB-V) dataset (Xiang et al., 2018) . 3DNEL MSIGP outperforms the previous state-of-the-art (SOTA) SurfEMB (Haugaard & Buch, 2022) in sim-to-real 6D pose estimation, albeit at the cost of increased computation. It is also significantly more robust: we show over 50% reduction in high-error pose predictions compared to SurfEMB. Extensive ablation studies illustrate the source of performance improvements. Existing approaches for 6D pose estimation are predominantly discriminative and bottom-up, and are specialized to the specific task of 6D pose estimation. In contrast, 3DNEL adopts a probabilistic generative formulation which extends beyond just pose estimation. To demonstrate the value of 3DNEL's generative formulation, we present additional experiments on 3DNEL's easy extension to object and camera tracking from video, using principled inference in the same probabilistic model without task specific retraining.

