3D NEURAL EMBEDDING LIKELIHOOD FOR ROBUST SIM-TO-REAL TRANSFER IN INVERSE GRAPHICS Anonymous

Abstract

A central challenge in 3D scene perception via inverse graphics is robustly modeling the gap between 3D graphics and real-world data. We propose a novel 3D Neural Embedding Likelihood (3DNEL) over RGB-D images to address this gap. 3DNEL uses neural embeddings to predict 2D-3D correspondences from RGB and combines this with depth in a principled manner. 3DNEL is trained entirely from synthetic images and generalizes to real-world data. To showcase this capability, we develop a multi-stage inverse graphics pipeline that uses 3DNEL for 6D object pose estimation from real RGB-D images. Our method outperforms the previous state-of-the-art in sim-to-real pose estimation on the YCB-Video dataset, and improves robustness, with significantly fewer large-error predictions. Unlike existing bottom-up, discriminative approaches that are specialized for pose estimation, 3DNEL adopts a probabilistic generative formulation that jointly models multi-object scenes. This generative formulation enables easy extension of 3DNEL to additional tasks like object and camera tracking from video, using principled inference in the same probabilistic model without task specific retraining.

1. INTRODUCTION

There is a widespread need for models that bridge the gap between 3D graphics and real RGB-D data. Accurate simulation environments exist in domains such as autonomous driving, augmented reality, and robotic manipulation, yet robust 3D scene perception remains a central bottleneck. "Inverse graphics" is an appealing approach to 3D scene understanding that treats scene perception as the inverse problem to 3D graphics. However, in practice these methods have been outperformed by more bottom-up, discriminative approaches, especially those using deep learning (LeCun et al., 2015) . A key challenge in inverse graphics is modeling the gap between rendered images and observed real-world images. This paper aims to address this gap with 3D Neural Embedding Likelihood (3DNEL), a likelihood model of RGB-D images that is trained entirely from synthetic data and generalizes to real-world data. 3DNEL uses learned neural embeddings to predict dense 2D-3D correspondences from RGB and combines this with 3D information from depth in a principled way. To showcase 3DNEL's capabilities in sim-to-real transfer, we develop a multi-stage inverse graphics pipeline (MSIGP) that uses 3DNEL for 6D object pose estimation. We demonstrate that 3DNEL can be applied to real RGB-D images without having to train on any real data. Our 3DNEL MSIGP consists of (1) a coarse enumerative procedure that generates pose hypotheses and an initial estimate of the 3D scene and (2) an iterative Markov Chain Monte Carlo (MCMC) process that finetunes the 3D scene. We empirically evaluate 3DNEL MSIGP on the popular YCB-Video (YCB-V) dataset (Xiang et al., 2018) . 3DNEL MSIGP outperforms the previous state-of-the-art (SOTA) SurfEMB (Haugaard & Buch, 2022) in sim-to-real 6D pose estimation, albeit at the cost of increased computation. It is also significantly more robust: we show over 50% reduction in high-error pose predictions compared to SurfEMB. Extensive ablation studies illustrate the source of performance improvements. Existing approaches for 6D pose estimation are predominantly discriminative and bottom-up, and are specialized to the specific task of 6D pose estimation. In contrast, 3DNEL adopts a probabilistic generative formulation which extends beyond just pose estimation. To demonstrate the value of 3DNEL's generative formulation, we present additional experiments on 3DNEL's easy extension to object and camera tracking from video, using principled inference in the same probabilistic model without task specific retraining.

2. RELATED WORK

3D Inverse Graphics Our method follows a long line of work in the "analysis-by-synthesis" paradigm that treats perception as the inverse problem to computer graphics (Kersten & Yuille, 1996; Yuille & Kersten, 2006; Lee & Mumford, 2003; Kersten et al., 2004; Mansinghka et al., 2013; Kulkarni et al., 2015) . While elegant and conceptually appealing, robustly modeling the gap between 3D graphics and real-world data, especially using appearance information, remains a central challenge in 3D inverse graphics. Our key observation is that, dense 2D-3D correspondences, widely used in many recent 6D pose estimation methods (Hodan et al., 2020; Li et al., 2019; He et al., 2020; Tremblay et al., 2018; Florence et al., 2018; Haugaard & Buch, 2022) , provide a natural way to model appearance information in 3D inverse graphics, and can be combined with depth information into a unified probabilistic generative model to effectively bridge the sim-to-real gap. Sim-to-Real Transfer Most computer vision systems require annotated real-world data for training (Hodan et al., 2017; Brachmann et al., 2014; Xiang et al., 2018; Krizhevsky et al., 2017) in order to achieve strong performance on real-world data at test time, since without it the "sim-to-real gap" is too difficult to overcome. Practically, it is tedious and expensive to collect and annotate real data. Recent advances in photorealistic rendering and physics-based simulations (Hodan et al., 2018; Tremblay et al., 2018; Denninger et al., 2019) have resulted in strong performance of models trained entirely on synthetic data. SurfEMB (Haugaard & Buch, 2022 ) is one such model that obtains SOTA performance in 6D pose estimation, outperforming numerous approaches that use real data for training. 3DNEL builds upon SurfEMB's learned dense 2D-3D correspondences, and expands previous work (Gothoskar et al., 2021) on probabilistic modeling of real depth data. It combines RGB and depth information into a unified probabilistic generative model, and improves SurfEMB nontrivially in both accuracy and robustness to achieve new SOTA in sim-to-real 6D pose estimation. 6D Object Pose Estimation 6D object pose estimation aims to infer the rigid SE(3) transformation (position and orientation) of an object in the camera frame, given an image observation. Discriminatively trained deep learning approaches (Xiang et al., 2018; Li et al., 2018; Deng et al., 2019; He et al., 2020; Sundermeyer et al., 2018) have, in general, outperformed more traditional feature or template matching methods (Besl & McKay, 1992; Rusinkiewicz & Levoy, 2001; Lowe, 1999; Rothganger et al., 2006; Collet et al., 2011) . With the advent of RGB-D cameras, there has been a growing interest in leveraging both the appearance information from RGB and the geometric shape information from depth. Existing methods either use depth to post process estimations from RGB (Xiang et al., 2018; Haugaard & Buch, 2022) , or fuse learned features from both RGB and real, noisy depth (Wang et al., 2019; He et al., 2021) . 3DNEL differs from prior work in two important ways. (1) 3DNEL combines RGB and depth information in a principled probabilistic model, which enables superior sim-to-real transfer. (2) 3DNEL jointly models multi-object scenes using a probabilistic generative formulation, allowing easy extension to tasks beyond 6D pose estimation.

3.1. PRELIMINARIES

Likelihood for 3D Inverse Graphics 3D inverse graphics formulates the perception problem as searching for the 3D scene description that can be rendered by a graphics engine to best reconstruct the input image. A central challenge in applying the 3D inverse graphics approach to real images is robustly modeling the "gap" between rendered and real images. In this paper we aim to develop a likelihood P(Observed RGB-D Image|3D scene description) that can combine shape and appearance information in a principled way to robustly assess how well an observed RGB-D image is explained by a 3D scene description. A naive approach is to render the 3D scene description to an RGB-D image and define the likelihood as a noise model that directly compares the rendered and real RGB-D images. Recent work 3DP3 (Gothoskar et al., 2021) demonstrates promising performance of this approach when applied to depth images. However, it is much more challenging to specify a sensible noise model operating directly on RGB images. Intuitively, the "gap" between rendered and real depth images is mainly due to small spatial displacements, yet the "gap" between rendered and real RGB images comes from a variety of different factors. In addition, a principled combination of RGB and depth information in a unified likelihood remains an open problem.

