HIVE: HIERARCHICAL VOLUME ENCODING FOR NEURAL IMPLICIT SURFACE RECONSTRUCTION

Abstract

Neural implicit surface reconstruction has become a new trend in reconstructing a detailed 3D shape from images. In previous methods, however, the 3D scene is only encoded by the MLPs which do not have an explicit 3D structure. To better represent 3D shapes, we introduce a volume encoding to explicitly encode the spatial information. We further design hierarchical volumes to encode the scene structures in multiple scales. The high-resolution volumes capture the highfrequency geometry details since spatially varying features could be learned from different 3D points, while the low-resolution volumes enforce the spatial consistency to keep the shape smooth since adjacent locations possess the same lowresolution feature. In addition, we adopt a sparse structure to reduce the memory consumption at high-resolution volumes, and two regularization terms to enhance results smoothness. This hierarchical volume encoding could be appended to any implicit surface reconstruction method as a plug-and-play module, and can generate a smooth and clean reconstruction with more details. Superior performance is demonstrated in DTU, EPFL, and BlendedMVS datasets with significant improvement on the standard metrics. The code of our method will be made public.

1. INTRODUCTION

Surface reconstruction, or image-based modeling (Tan, 2021), from multi-view images is a heavily studied classic task in computer vision. Given multiple images from different views of an object as well as the corresponding camera parameters, this task aims to recover the accurate 3D surface of the target object. Traditional methods (Fuhrmann et al., 2014) usually solve a depth map for each input image and then fuse (Kazhdan et al., 2006) those depth images to build a complete surface model. After the arising of deep networks, many methods try to exploit the neural networks to directly learn the mapping from 2D images to 3D surfaces (Murez et al., 2020; Sun et al., 2021) . These learningbased methods skip the intermediate depth map estimation and are highly efficient even on unseen objects and scenes. But they usually only recover coarse scale geometry and miss most of the high frequency surface details. Recently, many methods (Yariv et al., 2020; Wang et al., 2021; Yariv et al., 2021; Darmon et al., 2022) 1



Figure 1: Visualization of normal maps to highlight our advantages in recovering shape details.

