HIVE: HIERARCHICAL VOLUME ENCODING FOR NEURAL IMPLICIT SURFACE RECONSTRUCTION

Abstract

Neural implicit surface reconstruction has become a new trend in reconstructing a detailed 3D shape from images. In previous methods, however, the 3D scene is only encoded by the MLPs which do not have an explicit 3D structure. To better represent 3D shapes, we introduce a volume encoding to explicitly encode the spatial information. We further design hierarchical volumes to encode the scene structures in multiple scales. The high-resolution volumes capture the highfrequency geometry details since spatially varying features could be learned from different 3D points, while the low-resolution volumes enforce the spatial consistency to keep the shape smooth since adjacent locations possess the same lowresolution feature. In addition, we adopt a sparse structure to reduce the memory consumption at high-resolution volumes, and two regularization terms to enhance results smoothness. This hierarchical volume encoding could be appended to any implicit surface reconstruction method as a plug-and-play module, and can generate a smooth and clean reconstruction with more details. Superior performance is demonstrated in DTU, EPFL, and BlendedMVS datasets with significant improvement on the standard metrics. The code of our method will be made public.

1. INTRODUCTION

Surface reconstruction, or image-based modeling (Tan, 2021), from multi-view images is a heavily studied classic task in computer vision. Given multiple images from different views of an object as well as the corresponding camera parameters, this task aims to recover the accurate 3D surface of the target object. Traditional methods (Fuhrmann et al., 2014) usually solve a depth map for each input image and then fuse (Kazhdan et al., 2006) those depth images to build a complete surface model. After the arising of deep networks, many methods try to exploit the neural networks to directly learn the mapping from 2D images to 3D surfaces (Murez et al., 2020; Sun et al., 2021) . These learningbased methods skip the intermediate depth map estimation and are highly efficient even on unseen objects and scenes. But they usually only recover coarse scale geometry and miss most of the high frequency surface details. Recently, many methods (Yariv et al., 2020; Wang et al., 2021; Yariv et al., 2021; Darmon et al., 2022) or signed distance fields (SDF) (Wang et al., 2021) , and then leverage the neural radiance field (NeRF) (Mildenhall et al., 2020) to render the implicit geometry into color images. Thus, the difference between the rendered images and the input images could optimize the neural radiance field as well as the implicit geometry. Strong results have been demonstrated. However, in these methods, the 3D shape is implicitly encoded in the multi-layer perceptrons (MLPs). Although MLPs are compact and memory efficient, they do not have an explicit 3D structure, which may cause difficulties in optimizing the target 3D shape as is observed in mesh and point-cloud processing (Chibane et al., 2020; Peng et al., 2020) . Furthermore, it is also known (Sun et al., 2021) that compact MLPs are hard to encode all the geometry details, such that the recovered surface is prone to be over-smooth. To solve this problem, some methods employ feature volumes or feature hash tables to facilitate MLPs to encode the 3D space Fridovich-Keil et al. ( 2022 2022), there is usually only one high-resolution volume in their frameworks, in which case each voxel is updated in isolation. Due to the high degree of freedom in optimizing one voxel, it is hard to maintain a globally coherent shape, as is shown in Figure 2 (c ). For the hash-table encoding methods Müller et al. ( 2022), there are hash collisions which could cause some geometry defects, as is shown in Figure 2 (a) (b). To address these, we introduce a hierarchical volume encoding to encode the 3D space. This hierarchical structure has three advantages. First, the high-resolution feature volume helps to reason high-frequency geometry details in the corresponding locations. Secondly, the voxels in the lowresolution volumes contain the information of large space, which could enforce spatial consistency to keep the shape smooth. Thirdly, this hierarchical structure allows us to use low-dimensional features in the high-resolution volume, which helps to reduce memory consumption. To further reduce memory consumption, we sparsify high-resolution volumes with the preliminary surface reconstruction computed from low-resolution volumes. We only keep voxels nearby the surface of the preliminary results. Finally, we design two smoothness terms to facilitate the optimization to make the reconstructed shape clean. In the experiments, we demonstrate that by simply adding the proposed hierarchical volume encoding, the performance of different methods are all improved significantly. Specifically, the error of NeuS (Wang et al More than that, the error of NeuralWarp is reduced by 31% on the EPFL (Strecha et al., 2008) dataset with the "full" metric. Figure 1 shows an example from the DTU dataset. The color coded normal map clearly demonstrates our method can significantly boost NeuS (Wang et al., 2021) to recover more geometry details while keeping the surface smooth and clean. The main contributions of this work are summarized in the following: • We propose a hierarchical volume encoding, which can significantly boost the performance of neural implicit surface reconstruction as a plug-and-play module. • The hierarchical volume encoding is further improved by employing a sparse structure which reduces the memory consumption, and by enforcing two regularization terms that keep the reconstructed surface smooth and clean. • We demonstrate superior performance in three datasets.

2. RELATED WORK

Traditional multi-view surface reconstruction Traditional methods typically follow a long pipeline of structure-from-motion (i.e. camera calibration) (Schönberger & Frahm, 2016) , depth map estimation (Barnes et al., 2009) , and multi-depth fusion (Kazhdan et al., 2006) to reconstruct surface models from images. Many advanced geometric methods (Jiang et al., 2013; Lhuillier & Quan, 2005; Furukawa & Ponce, 2009; Galliani et al., 2015; Schönberger et al., 2016) have been developed to enhance these different steps in the last two decades.



Figure 1: Visualization of normal maps to highlight our advantages in recovering shape details.

); Sun et al. (2022a); Chen et al. (2022); Müller et al. (2022), which could directly encode the geometry of each 3D position faithfully and unambiguously. However, there are some problems in existing methods. For the volume encoding methods Fridovich-Keil et al. (2022); Sun et al. (2022a); Chen et al. (

., 2021) is reduced by 25% from 0.84 mm to 0.63 mm, the error of VolSDF (Yariv et al., 2021) is reduced by 23% from 0.86 mm to 0.66 mm, and the error of NeuralWarp (Darmon et al., 2022) is reduced by 10% from 0.68 mm to 0.61 mm on the DTU (Jensen et al., 2014) dataset.

