VOXURF: VOXEL-BASED EFFICIENT AND ACCURATE NEURAL SURFACE RECONSTRUCTION

Abstract

Neural surface reconstruction aims to reconstruct accurate 3D surfaces based on multi-view images. Previous methods based on neural volume rendering mostly train a fully implicit model with MLPs, which typically require hours of training for a single scene. Recent efforts explore the explicit volumetric representation to accelerate the optimization via memorizing significant information with learnable voxel grids. However, existing voxel-based methods often struggle in reconstructing fine-grained geometry, even when combined with an SDF-based volume rendering scheme. We reveal that this is because 1) the voxel grids tend to break the color-geometry dependency that facilitates fine-geometry learning, and 2) the under-constrained voxel grids lack spatial coherence and are vulnerable to local minima. In this work, we present Voxurf, a voxel-based surface reconstruction approach that is both efficient and accurate. Voxurf addresses the aforementioned issues via several key designs, including 1) a two-stage training procedure that attains a coherent coarse shape and recovers fine details successively, 2) a dual color network that maintains color-geometry dependency, and 3) a hierarchical geometry feature to encourage information propagation across voxels. Extensive experiments show that Voxurf achieves high efficiency and high quality at the same time. On the DTU benchmark, Voxurf achieves higher reconstruction quality with a 20x training speedup compared to previous fully implicit methods.

1. INTRODUCTION

Neural surface reconstruction based on multi-view images has recently seen dramatic progress. Inspired by the success of Neural Radiance Fields (NeRF) (Mildenhall et al., 2020) on Novel View Synthesis (NVS), recent works follow the neural volume rendering scheme to represent the 3D geometry with a signed distance function (SDF) or occupancy field via a fully implicit model (Oechsle et al., 2021; Yariv et al., 2021; Wang et al., 2021) . These approaches train a deep multilayer perceptron (MLP), which takes in hundreds of sampled points on each camera ray and outputs the corresponding color and geometry information. Pixel-wise supervision is then applied by measuring the difference between the accumulated color on each ray and the ground truth. Struggling with learning all the geometric and color details with a pure MLP-based framework, these methods require hours of training for a single scene, which substantially limits their real-world applications. Recent advances in NeRF accelerate the training process with the aid of an explicit volumetric representation (Sun et al., 2022a; Yu et al., 2022; Chen et al., 2022) . These works directly store and optimize the geometry and color information via explicit voxel grids. For example, the density of a queried point can be readily interpolated from the eight neighboring points, and the view-dependent color is either represented with spherical harmonic coefficients (Yu et al., 2022) or predicted by shallow MLPs that take learnable grid features as auxiliary inputs (Sun et al., 2022a) . These approaches achieve competitive rendering performance at a much lower training cost (< 20 minutes). However, their 3D surface reconstruction results cannot faithfully represent the exact geometry, suffering from ). However, we find this naïve baseline model not working well by losing most of the geometry details and producing undesired noise (Fig. 1 (c )). We reveal several critical issues for this framework as follows. First, in fully implicit models, the color network takes surface normals as inputs, effectively building colorgeometry dependency that facilitates fine-geometry learning. However, in the baseline model, the color network tends to depend more on the additional under-constrained voxel feature grid input, thus breaking color-geometry dependency. Second, due to a high degree of freedom in optimizing a voxel grid, it is hard to maintain a globally coherent shape without additional constraints. Individual optimization for each voxel point hinders the information sharing across the voxel grid, which hurts the surface smoothness and introduces local minima. We'll unveil these effects and introduce the insight for our architecture design via an empirical study in Sec. 4 To tackle the challenges, we introduce Voxurf, an efficient pipeline for accurate Voxel-based surface reconstruction: 1) We leverage a two-stage training process that attains a coherent coarse shape and recovers fine details successively. 2) We design a dual color network that is capable of representing a complex color field via a voxel grid and preserving the color-geometry dependency with two subnetworks that work in synergy. 3) We also propose a hierarchical geometry feature based on the SDF voxel grid to encourage information sharing in a larger region for stable optimization. 4) We introduce several effective regularization terms to boost smoothness and reduce noise. We conduct experiments on the DTU (Jensen et al., 2014) 1 , our method is shown to be superior in preserving high-frequency details in both geometry reconstruction and image rendering compared to the previous approaches. In summary, our contributions are highlighted below: 1. Our approach enables around 20x speedup for training compared to the SOTA methods, reducing the training time from over 5 hours to 15 minutes on a single Nvidia A100 GPU. 2. Our approach achieves higher surface reconstruction fidelity and novel view synthesis quality, which is superior in representing fine details for both surface recovery and image rendering compared to previous methods.



Figure 1: Comparisons among different methods for surface reconstruction and novel view synthesis.(a) DVGO (v2) (Sun et al., 2022a;b) benefits from the fastest convergence but suffers from a poor surface; (b) NeuS (Wang et al., 2021) produces decent surfaces after a long training time, while high-frequency details are lost in both the geometry and the image; (c) the straightforward combination of DVGO and NeuS produces continuous but noisy surfaces; (d) our method achieves around 20x speedup than NeuS and recovers highquality surfaces and images with fine details. All the training times are tested on a single Nvidia A100 GPU. conspicuous noise and holes (Fig. 1 (a)). It is due to the inherent ambiguity of the density-based volume rendering scheme, and the explicit volumetric representation introduces additional challenges.In this work, we aim to take advantage of the explicit volumetric representation for efficient training and propose customized designs to harvest high-quality surface reconstruction. A straightforward idea for this purpose is to embed the SDF-based volume rendering scheme(Wang et al., 2021; Yariv  et al., 2021)  into explicit volumetric representation frameworks(Sun et al., 2022a). However, we find this naïve baseline model not working well by losing most of the geometry details and producing undesired noise (Fig.1 (c)). We reveal several critical issues for this framework as follows. First, in fully implicit models, the color network takes surface normals as inputs, effectively building colorgeometry dependency that facilitates fine-geometry learning. However, in the baseline model, the color network tends to depend more on the additional under-constrained voxel feature grid input, thus breaking color-geometry dependency. Second, due to a high degree of freedom in optimizing a voxel grid, it is hard to maintain a globally coherent shape without additional constraints. Individual optimization for each voxel point hinders the information sharing across the voxel grid, which hurts the surface smoothness and introduces local minima. We'll unveil these effects and introduce the insight for our architecture design via an empirical study in Sec. 4

and BlendedMVS (Yao et al., 2020) datasets for quantitative and qualitative evaluations. Experimental results demonstrate that Voxurf achieves lower Chamfer Distance on the DTU (Jensen et al., 2014) benchmark than a competitive fully implicit method NeuS (Wang et al., 2021) with around 20x speedup. It also achieves remarkable results on the auxiliary task of NVS. As illustrated in Fig.

