SURFACE RECONSTRUCTION IN THE WILD BY DE-FORMING SHAPE PRIORS FROM SYNTHETIC DATA

Abstract

We present a new method for category-specific 3D reconstruction from a single image. A limitation of current deep learning color image-based 3D reconstruction models is that they do not generalize across datasets, due to domain shift. In contrast, we show that one can learn to reconstruct objects across datasets through shape priors learned from synthetic 3D data and a point cloud pose canonicalization method. Given a single depth image at test time, we first place this partial point cloud in a canonical pose. Then, we use a neural deformation field in the canonical coordinate frame to reconstruct the 3D surface of the object. Finally, we jointly optimize object pose and 3D shape to fit the partial depth observation. Our approach achieves state-of-the-art reconstruction performance across several real-world datasets, even when trained without ground truth camera poses (which are required by some of the state-of-the-art methods). We further show that our method generalizes to different input modalities, from dense depth images to sparse and noisy LIDAR scans.

1. INTRODUCTION

Reconstructing 3D object surfaces from images is a longstanding problem in the computer vision community, with applications to robotics (Bylow et al., 2013) or content creation (Huang et al., 2017) . Every computational approach aimed at 3D reconstruction has to answer the question of which representation is best suited for the underlying 3D structure. An increasingly popular answer is to use neural fields (Park et al., 2019; Mescheder et al., 2019) for this task. These neural fields, trained on 3D ground truth data, represent the de-facto gold standard regarding reconstruction quality. However, the reliance on 3D ground truth has, for now, limited these approaches to synthetic data. To remove the reliance on 3D data, the community has shifted to dense (Mildenhall et al., 2020) , or sparse (Zhang et al., 2021) multi-view supervision with known camera poses. Similarly, single-view 3D reconstruction methods have also made considerable progress by using neural fields as their shape representation (Lin et al., 2020; Duggal & Pathak, 2022) . While these single-view methods can be trained from unconstrained image collections, they have not achieved the high quality of multi-view or 3D ground truth supervised models. In this work, we aim to answer the question: How can we achieve the reconstruction quality of 3D supervised methods from single view observations in the wild? With recent advances in generative modeling of synthetic 3D data (Gao et al., 2022) , using 3D data for supervision has become practical once again. However, the problem of aligning image observations to canonical spaces remains challenging. One way to solve this alignment problem is to learn the camera pose from data (Ye et al., 2021) . However, learning camera pose prediction from color images is a complex problem, and existing methods do not generalize to new datasets due to domain shifts. Another promising research direction is the use of equivariant neural networks. For example, Condor (Sajnani et al., 2022) and Equi-Pose (Li et al., 2021) use equivariant network layers to canonicalize complete and partial point clouds through a self-supervised reconstruction loss. Given an image taken from a calibrated camera, instead of using ground truth camera poses during inference, as other single-view 3D reconstruction methods Lin et al. ( 2020); Duggal & Pathak (2022), we suggest using a single depth image together with a pretrained canonicalization network to register the partial point clouds to the canonical coordinate space. However, during our work we found that canonical reconstruction methods are extremely sensitive to deviations in the estimated canonical pose (Section 4.5). To recover from bad registration results, we jointly fine-tune 3D shape and pose using only the partial shape (Figure 2 ). We achieve 3D reconstruction results on synthetic data close to or better than the state-of-the-art. Furthermore, we show that using depth images as input allows for generalization across various datasets, from dense depth in synthetic and natural images to sparse depth inputs from LIDAR scans. Figure 1 : We leverage synthetic 3D data to learn a shape prior. Using a pose registration algorithm, we canonicalize partial point clouds to the canonical coordinate frame to generate diverse 3D reconstructions.

2. RELATED WORK

3D object reconstruction based on a conditional input, such as images or depth is an active research area (Chen et al., 2019; Mescheder et al., 2019; Mildenhall et al., 2020; Sitzmann et al., 2019; Tulsiani et al., 2017; Häni et al., 2020) . The defacto gold standard in terms of reconstruction quality uses 3D ground truth data (Park et al., 2019; Mescheder et al., 2019) . However, these approaches are largely limited to synthetic data, such as Shapenet (Chang et al., 2015) . Reconstruction of realworld shapes has been performed by transferring the learned representation across domains (Duggal et al., 2022; Bechtold et al., 2021) or with the use of special depth sensors (Newcombe et al., 2011; Choe et al., 2021) . However, collecting 3D ground truth data in the real world can be difficult. With the development of neural rendering and inverse graphics methods, the requirement for 3D ground truth has been relaxed in favor of dense multi-view supervision (Xu et al., 2019; Mildenhall et al., 2020; Goel et al., 2022; Zhang et al., 2021) or single view methods that require ground truth camera poses (Lin et al., 2020; Duggal & Pathak, 2022) . However, not all applications allow for the collection of multi-view images, and estimating camera poses from images remains challenging. With the advent of generative models for 3D shapes (Gao et al., 2022) , using 3D supervision has become an interesting prospect once more. However, these 3D models are all living in a canonical coordinate frame. Our work shows how we can leverage such canonical 3D data for shape reconstruction in the wild.

2.1. 3D RECONSTRUCTION FROM SINGLE VIEWS

There have been extensive studies on 3D reconstruction from single view images using various 3D representations, such as voxels (Yan et al., 2016; Tulsiani et al., 2017; Wu et al., 2018; Yang et al., 2018; Wu et al., 2018; 2017 ), points (Fan et al., 2017;; Yang et al., 2019 ), primitives (Deng et al., 2020; Chen et al., 2020) or meshes (Kanazawa et al., 2018; Goel et al., 2022) . Most of the methods above use explicit representations, which suffer from limited resolution or fixed topology. Neural rendering and neural fields provide an alternative representation to overcome these limitations. Recent methods showed how to learn Signed Distance Functions (SDFs) (Xu et al., 2019; Lin et al., 2020; Duggal & Pathak, 2022) or volumetric representations such as occupancy (Ye et al., 2021) , which have shown great promise in learning category-specific 3D reconstructions from unstructured image collections. However, these methods usually require additional information, such as ground truth camera poses, which limits their applicability. In our work, we propose a method that does not require ground truth camera poses and leverages widely available synthetic data to learn a categoryspecific 3D prior model.

