SURFACE RECONSTRUCTION IN THE WILD BY DE-FORMING SHAPE PRIORS FROM SYNTHETIC DATA

Abstract

We present a new method for category-specific 3D reconstruction from a single image. A limitation of current deep learning color image-based 3D reconstruction models is that they do not generalize across datasets, due to domain shift. In contrast, we show that one can learn to reconstruct objects across datasets through shape priors learned from synthetic 3D data and a point cloud pose canonicalization method. Given a single depth image at test time, we first place this partial point cloud in a canonical pose. Then, we use a neural deformation field in the canonical coordinate frame to reconstruct the 3D surface of the object. Finally, we jointly optimize object pose and 3D shape to fit the partial depth observation. Our approach achieves state-of-the-art reconstruction performance across several real-world datasets, even when trained without ground truth camera poses (which are required by some of the state-of-the-art methods). We further show that our method generalizes to different input modalities, from dense depth images to sparse and noisy LIDAR scans.

1. INTRODUCTION

Reconstructing 3D object surfaces from images is a longstanding problem in the computer vision community, with applications to robotics (Bylow et al., 2013) or content creation (Huang et al., 2017) . Every computational approach aimed at 3D reconstruction has to answer the question of which representation is best suited for the underlying 3D structure. An increasingly popular answer is to use neural fields (Park et al., 2019; Mescheder et al., 2019) for this task. These neural fields, trained on 3D ground truth data, represent the de-facto gold standard regarding reconstruction quality. However, the reliance on 3D ground truth has, for now, limited these approaches to synthetic data. To remove the reliance on 3D data, the community has shifted to dense (Mildenhall et al., 2020) , or sparse (Zhang et al., 2021) multi-view supervision with known camera poses. Similarly, single-view 3D reconstruction methods have also made considerable progress by using neural fields as their shape representation (Lin et al., 2020; Duggal & Pathak, 2022) . While these single-view methods can be trained from unconstrained image collections, they have not achieved the high quality of multi-view or 3D ground truth supervised models. In this work, we aim to answer the question: How can we achieve the reconstruction quality of 3D supervised methods from single view observations in the wild? With recent advances in generative modeling of synthetic 3D data (Gao et al., 2022) , using 3D data for supervision has become practical once again. However, the problem of aligning image observations to canonical spaces remains challenging. One way to solve this alignment problem is to learn the camera pose from data (Ye et al., 2021) . However, learning camera pose prediction from color images is a complex problem, and existing methods do not generalize to new datasets due to domain shifts. Another promising research direction is the use of equivariant neural networks. For example, Condor (Sajnani et al., 2022) and Equi-Pose (Li et al., 2021) use equivariant network layers to canonicalize complete and partial point clouds through a self-supervised reconstruction loss. Given an image taken from a calibrated camera, instead of using ground truth camera poses during inference, as other single-view 3D reconstruction methods Lin et al. (2020); Duggal & Pathak (2022), we suggest using a single depth image together with a pretrained canonicalization network to register the partial point clouds to the canonical coordinate space. However, during our work we found that canonical reconstruction methods are extremely sensitive to deviations in the estimated 1

