HIVE: HIERARCHICAL VOLUME ENCODING FOR NEURAL IMPLICIT SURFACE RECONSTRUCTION

Abstract

Neural implicit surface reconstruction has become a new trend in reconstructing a detailed 3D shape from images. In previous methods, however, the 3D scene is only encoded by the MLPs which do not have an explicit 3D structure. To better represent 3D shapes, we introduce a volume encoding to explicitly encode the spatial information. We further design hierarchical volumes to encode the scene structures in multiple scales. The high-resolution volumes capture the highfrequency geometry details since spatially varying features could be learned from different 3D points, while the low-resolution volumes enforce the spatial consistency to keep the shape smooth since adjacent locations possess the same lowresolution feature. In addition, we adopt a sparse structure to reduce the memory consumption at high-resolution volumes, and two regularization terms to enhance results smoothness. This hierarchical volume encoding could be appended to any implicit surface reconstruction method as a plug-and-play module, and can generate a smooth and clean reconstruction with more details. Superior performance is demonstrated in DTU, EPFL, and BlendedMVS datasets with significant improvement on the standard metrics. The code of our method will be made public.

1. INTRODUCTION

Surface reconstruction, or image-based modeling (Tan, 2021) , from multi-view images is a heavily studied classic task in computer vision. Given multiple images from different views of an object as well as the corresponding camera parameters, this task aims to recover the accurate 3D surface of the target object. Traditional methods (Fuhrmann et al., 2014) usually solve a depth map for each input image and then fuse (Kazhdan et al., 2006) those depth images to build a complete surface model. After the arising of deep networks, many methods try to exploit the neural networks to directly learn the mapping from 2D images to 3D surfaces (Murez et al., 2020; Sun et al., 2021) . These learningbased methods skip the intermediate depth map estimation and are highly efficient even on unseen objects and scenes. But they usually only recover coarse scale geometry and miss most of the high frequency surface details. Recently, many methods (Yariv et al., 2020; Wang et al., 2021; Yariv et al., 2021; Darmon et al., 2022) have achieved highly accurate reconstruction results based on neural implicit surface. These methods represent the 3D shape with an implicit function, such as occupancy (Oechsle et al., 2021) Image NeuS NeuS+Ours Figure 1 : Visualization of normal maps to highlight our advantages in recovering shape details. or signed distance fields (SDF) (Wang et al., 2021) , and then leverage the neural radiance field (NeRF) (Mildenhall et al., 2020) to render the implicit geometry into color images. Thus, the difference between the rendered images and the input images could optimize the neural radiance field as well as the implicit geometry. Strong results have been demonstrated. However, in these methods, the 3D shape is implicitly encoded in the multi-layer perceptrons (MLPs). Although MLPs are compact and memory efficient, they do not have an explicit 3D structure, which may cause difficulties in optimizing the target 3D shape as is observed in mesh and point-cloud processing (Chibane et al., 2020; Peng et al., 2020) . Furthermore, it is also known (Sun et al., 2021) that compact MLPs are hard to encode all the geometry details, such that the recovered surface is prone to be over-smooth. To solve this problem, some methods employ feature volumes or feature hash tables to facilitate MLPs to encode the 3D space Fridovich-Keil et al. (2022) ; Sun et al. (2022a) ; Chen et al. (2022) ; Müller et al. (2022) , which could directly encode the geometry of each 3D position faithfully and unambiguously. However, there are some problems in existing methods. For the volume encoding methods Fridovich-Keil et al. (2022) ; Sun et al. (2022a) ; Chen et al. (2022) , there is usually only one high-resolution volume in their frameworks, in which case each voxel is updated in isolation. Due to the high degree of freedom in optimizing one voxel, it is hard to maintain a globally coherent shape, as is shown in Figure 2 (c). For the hash-table encoding methods Müller et al. (2022) , there are hash collisions which could cause some geometry defects, as is shown in Figure 2 (a) (b). To address these, we introduce a hierarchical volume encoding to encode the 3D space. This hierarchical structure has three advantages. First, the high-resolution feature volume helps to reason high-frequency geometry details in the corresponding locations. Secondly, the voxels in the lowresolution volumes contain the information of large space, which could enforce spatial consistency to keep the shape smooth. Thirdly, this hierarchical structure allows us to use low-dimensional features in the high-resolution volume, which helps to reduce memory consumption. To further reduce memory consumption, we sparsify high-resolution volumes with the preliminary surface reconstruction computed from low-resolution volumes. We only keep voxels nearby the surface of the preliminary results. Finally, we design two smoothness terms to facilitate the optimization to make the reconstructed shape clean. In the experiments, we demonstrate that by simply adding the proposed hierarchical volume encoding, the performance of different methods are all improved significantly. Specifically, the error of NeuS (Wang et al., 2021) is reduced by 25% from 0.84 mm to 0.63 mm, the error of VolSDF (Yariv et al., 2021) is reduced by 23% from 0.86 mm to 0.66 mm, and the error of NeuralWarp (Darmon et al., 2022) is reduced by 10% from 0.68 mm to 0.61 mm on the DTU (Jensen et al., 2014) dataset. More than that, the error of NeuralWarp is reduced by 31% on the EPFL (Strecha et al., 2008) dataset with the "full" metric. Figure 1 shows an example from the DTU dataset. The color coded normal map clearly demonstrates our method can significantly boost NeuS (Wang et al., 2021) to recover more geometry details while keeping the surface smooth and clean. The main contributions of this work are summarized in the following: • We propose a hierarchical volume encoding, which can significantly boost the performance of neural implicit surface reconstruction as a plug-and-play module. • The hierarchical volume encoding is further improved by employing a sparse structure which reduces the memory consumption, and by enforcing two regularization terms that keep the reconstructed surface smooth and clean. • We demonstrate superior performance in three datasets.

2. RELATED WORK

Traditional multi-view surface reconstruction Traditional methods typically follow a long pipeline of structure-from-motion (i.e. camera calibration) (Schönberger & Frahm, 2016) , depth map estimation (Barnes et al., 2009) , and multi-depth fusion (Kazhdan et al., 2006) to reconstruct surface models from images. Many advanced geometric methods (Jiang et al., 2013; Lhuillier & Quan, 2005; Furukawa & Ponce, 2009; Galliani et al., 2015; Schönberger et al., 2016) have been developed to enhance these different steps in the last two decades. After the arising of the deep networks, almost all steps in the conventional pipeline have been revolutionized, including feature extraction and matching (DeTone et al., 2018; Sarlin et al., 2020) , structure-from-motion (Ummenhofer et al., 2017; Tang & Tan, 2019) , and depth map estimation (Yao et al., 2018; Gu et al., 2020; Tang et al., 2022) . To skip the intermediate depth map estimation, given multiple images with known camera poses, some methods try to directly regress a volumetric prediction end-to-end like an occupancy volume (Ji et al., 2017) or a TSDF volume (Murez et al., 2020; Sun et al., 2021) . While these methods simplify the overall pipeline and can be generalized to unseen objects and scenes, they often only learn to recover coarse scale geometry and miss many surface details. Implicit surface reconstruction A neural radiance field (NeRF) encodes a 3D scene (Mildenhall et al., 2020) implicitly in a neural network. The network is optimized to match the rendered images to the input images such that it can generate high quality novel view synthesis results. However, since there is no constraint imposed on the 3D geometry, the surface extracted from the implicit network usually has some defects due to tuning of the density threshold (Oechsle et al., 2021) . To address this problem, some methods first represent the 3D shape with an implicit geometry network, like occupancy grid (Niemeyer et al., 2020; Oechsle et al., 2021) or signed distance fields (Yariv et al., 2020; Wang et al., 2021; Yariv et al., 2021; Darmon et al., 2022; Sun et al., 2022b; Long et al., 2022; Fu et al., 2022; Yu et al., 2022) , and then transfer the output of the geometry network to a density function, with which the radiance network could render color images. In this way, the radiance network as well as the geometry network can be optimized together by matching the rendered and input images. In these methods, the 3D geometry is directly encoded in the MLPs without any explicit 3D spatial information. To facilitate MLPs, some methods propose to encode the 3D space with a single geometric volume Fridovich-Keil et al. ( 2022 2021) or hash tables Müller et al. (2022) . However, single-volume may cause noise due to the high degree of freedom in optimizing one voxel, and hash tables may cause hash collisions which leads to defects, as displayed in Figure 2 . In this paper, we introduce a hierarchical volume encoding to explicitly encode the 3D spatial information, thus our method can reconstruct more surface details while keeping the shape globally coherent. A similar structure is proposed in Takikawa et al. (2021) for the SDF encoding task, which sums up the features from a large-feature-channel octree, while we concatenate the features from multiple small-feature-channel volumes, in which case our method consumes less memory.

3.1. OVERVIEW

The overall framework of our method is illustrated in Figure 3 . Given multiple images {I i } N i=1 of an object and corresponding camera poses, the task is to reconstruct the 3D surface of this object. The 3D shape is represented by an implicit SDF network, with which another implicit color network render images using the neural volume rendering. To enhance the representation ability of the spatial information, we add a hierarchical volume encoding as the input of the SDF network, which can embed the 3D space in multiple scales. To save the memory consumption, we also introduce a sparse structure for high-resolution volumes. Finally, two implicit networks as well as the volume encoding are optimized by minimizing the difference between the rendered images and the input images as well as minimizing two regularization terms.

3.2. IMPLICIT SURFACE RECONSTRUCTION BASED ON VOLUME RENDERING

Neural implicit volume rendering is first introduced in (Mildenhall et al., 2020) for novel view rendering and then used in (Wang et al., 2021; Yariv et al., 2021; Darmon et al., 2022) for surface reconstruction. Our method could be added to any of them for better reconstruction. Here we take NeuS (Wang et al., 2021) as an example, which uses two multi-layer perceptrons (MLPs) to serve as two functions for representing an object: one is the SDF network sdf : R 3 → R that maps a spatial position x ∈ R 3 to its SDF value of the object surface, and the other one is the color network c : R 3 × S 2 → R that maps the spatial point x and a viewing direction v ∈ S 2 to a color value. The surface S is then represented by the zero-level set of the SDF function as: S = {x ∈ R 3 |sdf (x) = 0}. To optimize the SDF network, images are rendered from the color network as well as a weighting function computed from the SDF network. The volume rendering function for generating colors is calculated as C(o, v) = +∞ 0 w(t)c(p(t), v)dt, ( ) where w is the weighting function computed by the density from the SDF network. To make the weighting function unbiased and occlusion-aware, it is calculated as w(t) = T (t)ρ(t), where T (t) = e -t 0 ρ(u)du , (4) and ρ(t) = max( -dΦs dt (sdf (p(t))) Φ s (sdf (p(t))) , 0), Φ s (x) = (1 + e -sx ) -1 . (5)

3.3. HIERARCHICAL VOLUME ENCODING

Previous methods usually encode all the information of a 3D object in the MLPs. To assist the MLPs, we build a 3D feature volume as the input of the MLPs, which explicitly encodes the 3D spatial information. This feature volume can naturally encode the knowledge about the 3D space of the object, while being optimized as well as the MLPs from the rendering loss. To enhance the representation ability of the encoding, we also employ a hierarchical mechanism, where different scales of the 3D feature volumes are adopted, as is shown in Figure 4 . In the experiments, we find a combination of multi-scale volumes works better than a single large-width volume. This is reasonable since in low-resolution volumes, one voxel represent a large space, such that this space can have the same code. The same code could smooth this space and prevent crushing 3D shape, which is important in surface reconstruction. In a basic setting, we employ eight feature volumes for the 3D space encoding, whose resolutions increase from 2 × 2 × 2 × 4 to 256 × 256 × 256 × 4. When a point x is being rendered, eight features of the corresponding position from these eight volumes are trilinearly interpolated and concatenated to form F(x), which works as the input of the MLPs. Therefore, the SDF function becomes S = {x ∈ R 3 |sdf (F(x)) = 0}. q p q q p p q p q p q p F e a t u r e H W Features with 3 channels are used to encode two locations p and q. Note it only captures spatial variant features. Right: a hierarchical volume with lower dimensionality. Features have just 1 channel and the memory consumption is much less. The high resolution volume encodes spatial variant features, while the low resolution volume enforces spatial smoothness. It is evident that the higher the resolution of the volume is, the more details would be recovered, but a volume larger than 256×256×256 would consume much more memory. To solve this problem, a multi-stage optimization scheme is adopted with high-resolution sparse volumes.

3.4. SPARSE HIGH RESOLUTION VOLUME

2 -1 3 0 -1 -1 4 -1 -1 -1 -1 -1 -1 -1 -1 -1 Construct index table with 𝑽𝒉 𝒗𝒂𝒍𝒊𝒅 In general, a lot of voxels in a dense volume are unoccupied and invalid. Thus, we could save memory consumption a lot by using the shape reconstructed from the coarse volumes to cull voxels far from the surface in the highresolution volumes. To do this, we propose a multi-stage optimization framework. In the first stage, we use the abovementioned basic structure, where the resolution of the largest volume is 256. After the first stage, we obtain a coarse surface reconstruction S c . In the second stage, we utilize S c to cull unnecessary voxels to obtain the valid voxels V valid h in the high-resolution volume. Specifically, we first dilate the surface S c to ensure all the valid voxels nearing the surface are included. Therefore, we only need to optimize the embeddings of n valid voxels in V valid h instead of all voxels. Here, we use a simple data structure, an embedding table T e , to store these embeddings. Given a floating-point three-dimensional position x, we obtain the surrounding eight integer coordinates through the rounding operation, take out the embedding of these integer points, and then fuse them through trilinear interpolation to serve as the encoding of x. In order to efficiently extract the corresponding embedding from the embedding table T e , we construct another index table T i which stores the indexes of T e . The values in table T i are all initialized to -1, which corresponds to the last embedding in table T e . The length of the embedding table T e is only n + 1 (n ≪ N 3 ), such that the memory consumption is reduced a lot due to the sparse structure of the high-resolution volumes. More details are given in the appendix.

3.5. LOSS FUNCTION

We equip three previous methods (Wang et al., 2021; Yariv et al., 2021; Darmon et al., 2022) with our hierarchical volume encoding. To optimize the model, we use the losses in their work without changes, i.e. a rendering loss L color which minimizes the difference between the rendered colors and input colors, and an Eikonal loss L eik (Gropp et al., 2020) which encourages unit norm of the SDF function gradients. Besides, for NeuralWarp (Darmon et al., 2022) , an additional warping color loss L warp is adopted, which warps views on each other to enforce multi-view photo-consistency. In addition, we add two additional regularization terms L tv and L normal to make the reconstructed surfaces smooth and clean. The total variation (TV) (Rudin & Osher, 1994) regularization is applied to each embedding volume to make adjacent voxels have similar characteristics, in which case the geometry could be more continuous and compact. It is computed as: L tv = m i,j,k (V i+1,j,k -V i,j,k ) 2 + (V i,j+1,k -V i,j,k ) 2 + (V i,j,k+1 -V i,j,k ) 2 , ( ) where m is the number of the hierarchical volumes. Another regularization term L normal is a smoothness constraint for normal. For each pixel of the image, we calculate the accumulated normal gradients along the marching ray as N grad (o, v) = +∞ 0 w(t)n grad (t)dt, where n grad (t) is the gradient of the normal at p(t) and computed as n grad (t) = ∇ 2 sdf (F(p(t)). The normal regularization L normal is then defined as: L normal = 1 N pix pix ||N grad || 2 , ( ) where N pix is the number of pixels in one optimization. Thus, the final loss L is defined as: L = L ori + λ tv * L tv + λ normal * L normal , where L ori is the original loss of each method, λ tv and λ normal are two weighting factors.

4.1. IMPLEMENTATION DETAILS

This work is implemented in Pytorch and experimented on Nvidia 2080Ti GPUs. The Adam optimizer (0.9, 0.999) is used to update the network weights. The learning rate for the MLPs is set to 5e -4 and decreased to 1/20, while the learning rates for the volumes with resolutions of 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024 are set to 0.01, 0.01, 0.01, 0.01, 0.01, 0.001, 0.001, 1e -4 , 1e -4 , 1e -4 and decreased to 1/100. We adopt a three-stage training strategy by default, and the number of optimization iterations is set to the same as the original framework. Taking "NeuS + Ours" as an example, there are 300K iterations in total. In the first stage, eight volumes of resolution from 

4.2. EVALUATION

For a fair comparison, we follow previous methods to evaluate our method on DTU (Jensen et al., 2014) , EPFL (Strecha et al., 2008) and BlendedMVS (Yao et al., 2020) benchmark. The Chamfer L1 distance is used for evaluating the accuracy of the recovered surfaces. This metric is the average of the accuracy, which measures the distance from the reconstructed surface to the ground-truth surface, and the completeness, which measures the distance in reverse. We first evaluate our method on the DTU dataset and report the quantitative results in Table 1 . For a fair comparison, the meshes of all methods are extracted by marching cubes with resolution of 512. We follow previous work (Darmon et al., 2022; Oechsle et al., 2021) to clean the predicted mesh by visibility masks for more reasonable evaluation. As shown in Table 1 , the accuracy of previous methods is improved significantly by adding the hierarchical volume encoding. To be specific, the error of NeuS (Wang et al., 2021) is reduced by 25%, from 0.84 to 0.63, while the error of NeuralWarp (Darmon et al., 2022) is reduced by 10%, from 0.68 to 0.61. The reconstructed meshes are shown in Figure 6 , from where we can see that the surfaces of our method are smooth and clean, while containing accurate geometry details. For instance, the details of the house in the first scene in Figure 6 are recovered, especially the shape of the windows and the bricks of the roof, which is more remarkable in the normal map in Figure 1 . We then evaluate our method on the EPFL dataset. For a fair comparison, we follow Neu-ralWarp to use both the "full" chamfer distance and the "center" chamfer distance to evaluate the reconstructed surfaces. The "center" metric only evaluates the center of the scene cropped by a manually defined box, which focuses more on the precision of the reconstruction, while the "full" metric is also influenced by the ground plane and the rarely seen points, which thus also considers the completeness of the reconstruction. As shown in Table 2 , the performance of each framework is improved significantly by adding our method. Especially, the "full" metric of NeuralWarp is decreased by 31% to 5.72, which is also a 21% improvement compared to COLMAP (Schönberger et al., 2016) . Finally, the evaluation is performed on the BlendedMVS dataset, which contains objects of more complex shapes. As shown in Figure 6 , our method can recover more detailed shapes than previous methods. These illustrate that our hierarchical volume encoding can facilitate MLPs to encode more complex shapes and improve the neural implicit surface reconstruction.

4.3. ABLATION STUDY

We choose "NeuS + Ours" to perform extensive ablation studies to validate the proposed method. All the results are reported from the 15 scans of DTU dataset unless otherwise specified. Hierarchical volume encoding. To study the effectiveness of the hierarchical volume encoding, we equip the basic set of eight volumes to NeuS (Wang et al., 2021) without any other change. As shown in Table 3 , the error of the recovered surfaces is significantly reduced from 0.84 to 0.70. Regularization terms. We also ablate the two smoothness terms to study their effectiveness. As shown in Table 3 , they help to further reduce the error metric down to 0.66. Sparse high resolution volume. We analyze the effect of the resolution of volumes. As shown in Table 3 , sparse high resolution embedding volume (resolution of 1024 here) further improves the accuracy to 0.63. We also visualize the reconstructed surfaces without (left) and with (right) sparse high resolution embedding in Figure 7 (a). It is clear that the high-resolution embedding generates more shape details, e.g., clearer facial shapes. Setup of embedding volumes. To further inspect the effect of different combinations of the volume encoding, we perform two experiments on the volume setups. (1) We study the number of the volumes. We first compare our hierarchical setup to a large volume of 256 3 × 32, the number of whose learnable parameters is about 8 times larger than ours. However, its performance is much worse than ours, as is presented in the left of Figure 8 . Then we keep the number of feature channels fixed at 32 and increase the volume number. The resolution of the first volume is set to 256, and the resolution of the remaining volumes is decreased by 1/2 one by one. As shown in Figure 8 left, the accuracy of the reconstructed surface is gradually improved with the arising of the volume number. ( 2) Next, we study the impact of the low-resolution volume. We keep the resolution of the largest volume fixed at 256, the number of the volumes fixed at 8, and vary the resolution of the smallest volume from 2 to 64. The resolutions of intermediate volumes are sequentially enlarged, and the enlargement factor is (256/min res) 1/8 . From the results shown in Figure 8 right, we can see the performance is decreased when the low-resolution volume begins from a larger resolution. This is intuitive because the low-resolution volume enforces spatial smoothness. Starting from a lower resolution helps to enforce smoothness across larger areas. Layer number of MLPs. Finally, we experiment with the layer number of MLPs in the SDF network to inspect its effect. When only half of the layers are used, the chamfer distance increases to 0.71. More details are included in the appendix.

4.4. COMPARISON TO HASH ENCODING

A similar hierarchical feature encoding is adopted in Instant-NGP (Müller et al., 2022) , but the features are stored in hash tables such that the hash collision is inevitable. Although this is efficient in both memory and convergence speed, the hash collision may result in defects in the implicit geometry. To make a comparison to hash encoding, we perform two experiments: one is the original version of Instant-NGP, and the other one is adding the hash encoding to NeuS framework. Some of the results are presented in Figure 7 (c) and Table 1 . For more details and results please see the appendix. From the results, we can see that although Instant-NGP or Hash+NeuS could obtain accurate novel view synthesis, there are some defects in their surface. However, the convergence speed of Instant-NGP is faster than our method, which is due to the fact that we still employ the large MLPs used in NeuS.

5. CONCLUSION

We propose to explicitly encode the 3D shape surface by hierarchical volumes to facilitate MLPs in neural implicit surface reconstruction methods. The spatially varying features can be obtained from high resolution volume to reason more details for each query 3D point, while the feature from lowresolution volumes could reason spatial consistency to keep shapes smooth. We also design a sparse structure to reduce the memory consumption of high-resolution volumes, and two regularization terms to enhance surface smoothness. Our hierarchical volume encoding could be appended to any implicit surface reconstruction method as a plug-and-play module to significantly boost their performance, which is demonstrated in three datasets.

A APPENDIX

A.1 DATASETS DTU (Jensen et al., 2014) is a well-used dataset for multi-view reconstruction, consisting of 124 scans of various objects. The images are obtained from an RGB camera and a structured light scanner mounted on an industrial robot arm. Each scene is captured from 49 or 64 views with a resolution of 1600 × 1200. For a fair comparison, we follow previous methods and select 15 scenes from DTU for evaluation. BlendedMVS (Yao et al., 2020) is another dataset for multi-view reconstruction composed of 113 scenes including architectures, sculptures, and small objects. Each scene consists of dozens to hundreds of images with a resolution of 768 × 576. EPFL (Strecha et al., 2008) is a small dataset composed of two outdoor scenes, Fountain and Herzjesu, which contain 11 and 9 high-resolution images, respectively, and the accurate ground truth meshes.

A.2 DETAILS OF SPARSE VOLUME STORING

We look up the values of V valid h from table T e through T i . Specifically, we first define a mapping function f : f (x) = x + y * N + z * N 2 , ( ) where N is the resolution of the volume, and x, y, z are the coordinates of the position x. We utilize f to map each voxel v j in V valid h to the index in table T i , and then T i further converts it to the index j in T e : T i (f (x)) = j, j = 1...n where n is the number of valid voxels in V valid h . The length of the embedding table T e is only n + 1 (n ≪ N 3 ), such that the memory consumption is reduced a lot due to the sparse structure of the high-resolution volumes.

A.3 LIMITATIONS

(1) Our framework is memory bounded and has difficulties in reconstructing large-scale scenes. This is also a common problem of volume-based methods. (2) The voxels pruned at low resolution cannot be recovered at high resolution. To address this, we dilate the low-resolution surface, such that usually all the useful voxels would be included. But there is still a chance that the voxels would be pruned by mistake.

A.4 MORE ABLATION STUDY

Layer number of MLPs We perform experiments on the layer number of MLPs in the SDF network to inspect its effect. As shown in Table 4 , the mean chamfer distance decreases with the increase of the layer number of MLPs. NeuS (Wang et al., 2021) adopts 9 layers of MLPs and gets an error of 0.84, while the error of our method already decreases to 0.78 with only 2 layers. 



Visual comparison of different encoding on DTU scan-24.

); Sun et al. (2022a); Chen et al. (2022); Martel et al. (

Figure 3: Method overview. In the first stage, we compute an initial result use features from volumes with resolution from 2 to 256. In a later stage, we finalize the result use features from sparsified high resolution volumes with a resolution of 512 or 1024.

Figure 4: A 2D toy example of the hierarchical volume encoding. Left: a single high resolution volume.Features with 3 channels are used to encode two locations p and q. Note it only captures spatial variant features. Right: a hierarchical volume with lower dimensionality. Features have just 1 channel and the memory consumption is much less. The high resolution volume encodes spatial variant features, while the low resolution volume enforces spatial smoothness.

Figure 5: Sparse high-resolution volume. The index of -1 fetches the last embedding (in dark gray) in Te.

Figure 6: Visual comparison of the reconstructed meshes.

Figure 7: Ablation study. (a) Sparse high-res volume. The results of without and with sparse high-resolution volumes are displayed in the first and second rows, respectively. The top row shows the normal maps and the bottom row shows the reconstructed meshes. (b) Regularization terms. The color image, the results of without and with regularization terms are displayed. (c) NeuS+Hash vs. NeuS+Ours.

Figure 7 (b) shows a visual comparison without (middle) and with (right) the regularization terms. The reconstructed mesh is smoother and more complete with the regularization terms.

Figure 8: Ablation study about the number of the volumes and the resolution combination of volumes.

Quantitative results on the DTU dataset.

Quantitative results on the EPFL dataset.

Ablation study on the DTU dataset.

Quantitative results for the different layer number of MLPs on the DTU dataset. Ablation study on NeuralWarp+Ours To better inspect the effectiveness of our modules on different frameworks, we present the ablation study on the NeuralWarp Darmon et al. (2022) framework, as shown in Table 5 and Figure 9.

A.5 RESULTS ON NORMAL CONSISTENCY

We report the normal consistency metric in Table 6 . This score is obtained by first calculating the mean absolute dot product of the normals in the reconstructed mesh and the normals at the corresponding nearest neighbors in the ground-truth mesh, and then calculating that in reverse. A.6 COMPARISON TO PLENOXELS AND INSTANT-NGP.To better see the difference between our method and Plenoxels or Instant-NGP, we perform some detailed experiments on the Scan-24 of DTU dataset Jensen et al. (2014) .The first one is the original version of Instant-NGP with hash tables of size 2 24 . The batch size (the number of sampled points in one iteration) is set to 65536 for a fair comparison of the efficiency. Other parameters are kept the same as the original code, while the images and camera poses are kept the same as in our experiments. The result of this experiment can be seen in Figure 11 , where the accuracy of novel view synthesis is high but the surface is noisy.The second one is NeuS+Hash, where we add to NeuS with the hash tables of the same settings as the Instant-NGP experiment. We add the feature embedding from the hash tables to NeuS while keeping other parameters of NeuS unchanged. The result of this experiment can be seen in Figure 11 , where NeuS+Hash could obtain the surface with high quality in most regions, but there are still some defects on the surface, especially in the less-seen regions, which take a weak place in the hash collision.The third one is NeuS+Plenoxels, where we add to NeuS with a single-volume encoding of resolution 256 3 . We still keep 2 layers of MLPs, because the result is a mess without any MLPs. We also tried the original version of Plenoxels, which performs badly on this dataset. From the results displayed in Figure 10 , the results of NeuS+Plenoxels is noisy, despite that the image quality is high.However, the convergence speed of Instant-NGP is faster than ours. This is because we still use large MLPs equipped in NeuS, which requires more time to optimize. Also, the hash tables are memory efficient, such that our method requires about three times more memory compared to Instant-NGP or NeuS+Hash in these experiments. Under review as a conference paper at ICLR 2023

A.7 EVALUATION OF NOVEL VIEW SYNTHESIS

To evaluate the effectiveness of our method in novel view synthesis, we perform the experiments and report the results in Table 7 , Table 8 , and Figure 12 . We select one of each 7 images from the original image set of DTU as the test views, and the remaining images as the training views. The accuracy of novel view synthesis on 15 scans of DTU is reported in Table 7 , while the PSNR result with respect to iteration number is reported in Table 8 . From the results in these two tables and the images displayed in Figure 12 , we can see both the accuracy and convergence speed of the novel view synthesis benefit from our hierarchical volume encoding. 

