3D FORMER: MONOCULAR SCENE RECONSTRUCTION WITH 3D SDF TRANSFORMERS

Abstract

Monocular scene reconstruction from posed images is challenging due to the complexity of a large environment. Recent volumetric methods learn to directly predict the TSDF volume and have demonstrated promising results in this task. However, most methods focus on how to extract and fuse the 2D features to a 3D feature volume, but none of them improve the way how the 3D volume is aggregated. In this work, we propose an SDF transformer network, which replaces the role of 3D CNN for better 3D feature aggregation. To reduce the explosive computation complexity of the 3D multi-head attention, we propose a sparse window attention module, where the attention is only calculated between the non-empty voxels within a local window. Then a top-down-bottom-up 3D attention network is built for 3D feature aggregation, where a dilate-attention structure is proposed to prevent geometry degeneration, and two global modules are employed to equip with global receptive fields. The experiments on multiple datasets show that this 3D transformer network generates a more accurate and complete reconstruction, which outperforms previous methods by a large margin. Remarkably, the mesh accuracy is improved by 41.8%, and the mesh completeness is improved by 25.3% on the ScanNet dataset.

1. INTRODUCTION

Monocular 3D reconstruction is a classical task in computer vision and is essential for numerous applications like autonomous navigation, robotics, and augmented/virtual reality. Such a vision task aims to reconstruct an accurate and complete dense 3D shape of an unstructured scene from only a sequence of monocular RGB images. While the camera poses can be estimated accurately with the state-of-the-art SLAM (Campos et al., 2021) or SfM systems (Schonberger & Frahm, 2016) , a dense 3D scene reconstruction from these posed images is still a challenging problem due to the complex geometry of a large-scale environment, such as the various objects, flexible lighting, reflective surfaces, and diverse cameras of different focus, distortion, and sensor noise. Many previous methods reconstruct the scenario in a multi-view depth manner (Yao et al., 2018; Chen et al., 2019; Duzceker et al., 2021) . They predict the dense depth map of each target frame, which can estimate accurate local geometry but need additional efforts in fusing these depth maps (Murez et al., 2020; Sun et al., 2021) , e.g., solving the inconsistencies between different views. Recently, some methods have tried to directly regress the complete 3D surface of the entire scene (Murez et al., 2020; Sun et al., 2021) from a truncated signed distance function (TSDF) representation. They first extract the 2D features with 2D convolutional neural networks (CNN), and then back-project the features to 3D space. Afterward, the 3D feature volume is processed by a 3D CNN network to output a TSDF volume prediction, which is extracted to a surface mesh by marching cubes (Lorensen & Cline, 1987) . This way of reconstruction is end-to-end trainable, and is demonstrated to output accurate, coherent, and complete meshes. In this paper, we follow this volume-based 3D reconstruction path and directly regress the TSDF volume. Inspired by recent successes of vision transformer (Vaswani et al., 2017; Dosovitskiy et al., 2020) , some approaches (Bozic et al., 2021; Stier et al., 2021) have adopted this structure in 3D reconstruction, but their usages are all limited to fusing the 2D features from different views while the aggregation of the 3D feature volumes is still performed by the 3D CNN. In this paper, we claim that the aggregation of 3D feature volume is also critical, and the evolution from 3D CNN to 3D multi-head attention could further improve both the accuracy and completeness of the reconstruction. Obviously, the limited usage of 3D multi-head attention in 3D feature volume aggregation is mainly due to its explosive computation. Specifically, the attention between each voxel and any other voxel needs to be calculated, which is hard to be realized in a general computing platform. This is also the reason why there are only a few applications of 3D transformers in solving 3D tasks. In this work, to address the above challenges and make the 3D transformer practical for 3D scene reconstruction, we propose a sparse window multi-head attention structure. Inspired by the sparse CNN (Yan et al., 2018) , we first sparsify the 3D feature volume with predicted occupancy, in which way the number of the voxels is reduced to only the occupied ones. Then, to compute the attention score of a target voxel, we define a local window centered on this voxel, within which the non-empty voxels are considered for attention computing. In this way, the computation complexity of the 3D multi-head attention can be reduced by orders of magnitude, and this module can be embedded into a network for 3D feature aggregation. Therefore, with this module, we build the first 3D transformer based top-down-bottom-up network, where a dilate-attention module and its inverse are used to downsample and upsample the 3D feature volume. In addition, to make up for the local receptive field of the sparse window attention, we add a global attention module and a global context module at the bottom of this network since the size of the volume is very small at the bottom level. With this network, the 3D shape is estimated in a coarse-to-fine manner of three levels, as is displayed in Figure 1 . To the best of our knowledge, this is the first paper employing the 3D transformer for 3D scene reconstruction from a TSDF representation. In the experiments, our method is demonstrated to outperform previous methods by a significant margin on multiple datasets. Specifically, the accuracy metric of the mesh on the ScanNet dataset is reduced by 41.8%, from 0.055 to 0.032, and the completeness metric is reduced by 25.3%, from 0.083 to 0.062. In the qualitative results, the meshes reconstructed by our method are dense, accurate, and complete. The main contributions of this work are then summarized as follows: • We propose a sparse window multi-head attention module, with which the computation complexity of the 3D transformer is reduced significantly and becomes feasible. • We propose a dilate-attention structure to avoid geometry degeneration in downsampling, with which we build the first top-down-bottom-up 3D transformer network for 3D feature aggregation. This network is further improved with bottom-level global attention and global context encoding. • This 3D transformer is employed to aggregate the 3D features back-projected from the 2D features of an image sequence in a coarse-to-fine manner, and predict TSDF values for accurate and complete 3D reconstruction. This framework shows a significant improvement in multiple datasets.

2. RELATED WORK

Depth-based 3D Reconstruction. In traditional methods, reconstructing a 3D model of a scene usually involves depth estimating for a series of images, and then fusing these depths together into a 3D data structure (Schönberger et al., 2016) . After the rising of deep learning, many works have tried to estimate accurate and dense depth maps with deep neural networks (Yao et al., 2018; Wang & Shen, 2018; Chen et al., 2019; Im et al., 2019; Yuan et al., 2021; 2022; Long et al., 2021) . They usually estimate the depth map of the reference image by constructing a 3D cost volume from several frames in a local window. Also, to leverage the information in the image sequence, some other methods try to propagate the message from previously predicted depths utilizing probabilistic filtering (Liu et al., 2019) , Gaussian process (Hou et al., 2019a) , or recurrent neural networks (Duzceker et al., 2021) . Although the predicted depth maps are increasingly accurate, there is still a gap between these single-view depths and the complete 3D shape. Post mesh generation like Poisson reconstruction (Kazhdan & Hoppe, 2013) , Delaunay triagulation (Labatut et al., 2009), and TSDF fusion (Newcombe et al., 2011) are proposed to solve this problem, but the inconsistency between different views is still a challenge. Volume-based 3D Reconstruction. To avoid the depth estimation and fusion in 3D reconstruction, some methods try to directly regress a volumetric data structure end-to-end. SurfaceNet (Ji et al., 2017) encodes the camera parameters together with the images to predict a 3D surface occupancy volume with 3D convolutional networks. Afterward, Atlas (Murez et al., 2020) back-projects the 2D features of all images into a 3D feature volume with the estimated camera poses, and then feeds this 3D volume into a 3D U-Net to predict a TSDF volume. Then NeuralRecon (Sun et al., 2021) improves the efficiency by doing this within a local window and then fusing the prediction together using a GRU module. Recently, to improve the accuracy of the reconstruction, some methods also introduce transformers to do the fusion of 2D features from different views (Bozic et al., 2021; Stier et al., 2021) . However, their transformers are all limited in 2D space and used to process 2D features, which is not straightforward in the 3D reconstruction task. There are also some methods for object 3D shape prediction, which can infer the 3D shape of objects with only a few views (Xie et al., 2020; Wang et al., 2021a) . But the network of these methods can only infer the shape of one category of small objects. Lately, some works represent the 3D shape with an implicit network, and optimize the implicit representation by neural rendering (Yariv et al., 2020; Wang et al., 2021b; Yariv et al., 2021) . These methods could obtain a fine surface of an object with iterative optimization, but with the cost of a long-time reconstruction. Transformers in 3D Vision. The transformer structure (Vaswani et al., 2017) has attracted a lot of attention and achieved many successes in vision tasks (Dosovitskiy et al., 2020; Liu et al., 2021) . Most of them, nevertheless, are used for 2D feature extraction and aggregation. Even in 2D feature processing, the computation complexity is already quite high, so many works are proposed to reduce the resource-consuming (Dosovitskiy et al., 2020; Liu et al., 2021) . Directly extending the transformer from 2D to 3D would cause catastrophic computation. Thus most works are only carefully performed on resource-saving feature extraction, e.g., the one-off straightforward feature mapping without any downsampling or upsampling (Wang et al., 2021a) , where the size of the feature volume remains unchanged, or the top-down tasks with only downsampling (Mao et al., 2021) , where the size of the feature volume is reduced gradually. In 3D reconstruction, however, a top-downbottom-up structure is more reasonable for feature extraction and shape generation, as in most of the 3D-CNN-based structures (Murez et al., 2020; Sun et al., 2021; Stier et al., 2021) . So in this work, we design the first 3D transformer based top-down-bottom-up structure for improving the quality of 3D reconstruction. In addition, a sparse window multi-head attention mechanism is proposed to save the computation cost. Although the sparse structure can handle the highly-sparse data, like the object detection of Lidar points (Mao et al., 2021) , it is not suitable for processing a relatively-dense data, like a mesh of an indoor scene. Therefore, a sparse window structure is needed in 3D scene reconstruction, where a dense surface within a window could be sufficiently aggregated.

3.1. OVERVIEW

The overview framework of our method is illustrated in Figure 1 . Given a sequence of images {I i } N i=1 of a scene and the corresponding camera intrinsics {K i } N i=1 and extrinsics {P i } N i=1 , we first extract the image features {F i } N i=1 in 2D space in three levels, and then back project these 2D features to 3D space, which are fused to three feature volumes in the coarse, medium, and fine levels, respectively. Afterward, these three feature volumes are aggregated by our SDF 3D transformer in a coarse-to-fine manner. At the coarse and medium levels, the output of the 3D transformer is two occupancy volumes O 2 , O 1 , while at the fine level, the output is the predicted TSDF volume S 0 . The coarse occupancy volume O 2 and the medium occupancy volume O 1 store the occupancy values o ∈ [0, 1] of the voxels, which are used to sparsify the finer level. Therefore, the feature volumes could be processed sparsely to reduce the computation complexity. Finally, the predicted mesh is extracted using marching cubes (Lorensen & Cline, 1987) from the TSDF volume S 0 .

3.2. FEATURE VOLUME CONSTRUCTION

The 2D features {F l i } N i=1 in three levels l = 0, 1, 2 are extracted by a feature pyramid network (Lin et al., 2017) with the MnasNet-B1 (Tan et al., 2019) as the backbone. The resolution of the features at these three levels are 1 4 , 1 8 , 1 16 , respectively. Then following Murez et al. (2020) , we back project the 2D features to 3D space with the camera parameters {K i } N i=1 and {P i } N i=1 , generating 3D feature volumes {V l i } N i=1 of size N X × N Y × N Z . In previous work, usually the fusion of these feature volumes from different views is computed by taking the average (Murez et al., 2020; Sun et al., 2021) . However, the back-projected features from different views contribute differently to the 3D shape, e.g., the view with a bad viewing angle and the voxels far from the surface. Therefore, a weighted average is more reasonable than taking the average. To compute these weights, for each voxel we calculate the variance of the features of different views by Var l i = (V l i -V l ) 2 , where V l is the average of the features of all views. Then we feed the features and the variance into a small MLP to calculate the weights W i , which are used to compute a weighted average of the features from different views as V l w = 1 N i V l i × SoftMax(W i ), where × denotes element-wise multiplication. Inspired by Yao et al. (2018) , we also calculate the total variance of all feature volumes and then concatenate it with the weighted average to the final feature volumes, as V l = {V l w , 1 N i Var l i },

3.3. SPARSE WINDOW MULTI-HEAD ATTENTION

The multi-head attention structure has been shown to be effective in many vision tasks (Dosovitskiy et al., 2020; Liu et al., 2021) . Most of them, however, are limited to 2D feature processing rather than 3D feature processing. This is because the computation complexity of the multi-head attention is usually higher than convolutional networks, which problem is further enlarged in 3D features. To compute this for a 3D feature volume, the attentions between a voxel and any other voxels need to be computed, i.e., N X × N Y × N Z attentions for one voxel and N X × N Y × N Z × N X × N Y × N Z attentions for all voxels, which is extremely large and hard to be realized in regular GPUs. To deal with this problem and make the multihead attention of 3D volumes feasible, we propose to use a sparse window structure to calculate the attention. As is displayed in Figure 1 , in the medium and the fine level, we sparsify the volumes using the occupancy prediction O 2 , O 1 , and only compute the attention of the non-empty voxels. In addition, considering that the nearby voxels contribute more to the shape of the current voxel and the distant voxels contribute less, we only calculate the attention within a local window of each voxel, as is shown in Figure 2 . Therefore, we are able to only calculate the multi-head attention of the occupied voxels within a small window, in which way the computation complexity is reduced significantly. Specifically, for any non-empty voxel v i in the feature volume V , we first search all non-empty voxels within a n × n × n window centered on this voxel and get the neighbor voxels {v j , j ∈ Ω(i)}. Then the query, key, and value embeddings are calculated as Q i = L q (V (v i )), K j = L k (V (v j )), V j = L v (V (v j )), where L q , L k , L v are the linear projection layers. For the position embedding P , we hope to block the influence from the scale of the 3D world coordinates. Hence we compute it based on the relative voxel position in the volume rather than based on the real-world coordinates (Mao et al., 2021) , as P j = L p (v j -v i ). Then the attention is calculated as Attention(v i ) = j∈Ω(i) SoftMax(Q i (K j + P j )/ √ d)(V j + P j ). In this case, the computation complexity is reduced from O 3D-Attn = N X × N Y × N Z × N X × N Y × N Z × O(ij), to O SW-3D-Attn = N occu × n occu × O(ij), where O(ij) is the complexity of one attention computation between voxel v i and v j , N occu is the number of occupied voxels in the volume, and n occu is the number of occupied voxels within the local window. Assuming that the occupancy rate of the volume is 10% and the window size is 1 10 of the volume size, the computation complexity of the sparse window attention would be only n 3 /10 10N X N Y N Z = 1 100000 of the dense 3D attention.

3.4. SDF 3D TRANSFORMER

Limited by the high resource-consuming of the multi-head attention, most of the previous works related to 3D transformers are only carefully performed on resource-saving feature processing, e.g., the one-off straightforward feature mapping without any downsampling or upsampling (Wang et al., 2021a) , where the size of feature volumes remains unchanged, or the top-down tasks with only downsampling (Mao et al., 2021) , where the size of feature volumes is reduced gradually. In 3D reconstruction, however, a top-down-bottom-up structure is more reasonable for feature extraction and prediction generation, as in most of the 3D-CNN-based structures (Murez et al., 2020; Sun et al., 2021; Stier et al., 2021) . So in this work, we design the first 3D transformer based top-down-bottomup structure, as is shown in Figure 3 . Taking the network for the fine volume (V 0 in Figure 1 ) as an example, there are four feature levels in total, i.e. 1 2 , 1 4 , 1 8 , 1 16 , as shown in Figure 3 . In the encoder part, at each level, a combination of downsampling and dilate-attention is proposed to downsample the feature volume. Then two blocks of the sparse window multi-head attention are used to aggregate the feature volumes. At the bottom level, a global attention block is employed to make up the small receptive field of the window attention, and a global context encoding block is utilized to extract the global information. In the decoder part, we use the inverse sparse 3D CNN to upsample the feature volume, i.e., we store the mapping of the down flow and now restore the spatial structure by inversing the sparse 3D CNN in the dilate-attention. Therefore, the final shape after the up flow should be the same as the input. Similar to FPN (Lin et al., 2017) , the features in the down flow are also added to the upsampled features in the corresponding level. To enable the deformation ability, a post-dilate-attention block is equipped after the down-up flow. Finally, a submanifold 3D CNN head with Tanh activation is appended to output the TSDF prediction. For the coarse volume V 2 and medium volume V 1 , two and three-level of similar structures with Sigmoid activation are adopted. Dilate-attention. The direct downsampling of a sparse structure is prone to losing geometry structure. To deal with this, between each level we first downsample the feature volume, and then dilate the volume with a sparse 3D CNN with the kernel size of 3, which calculates the output if any voxel within its kernel is non-empty. The dilation operation alone may also harm the geometry, since it may add some wrong voxels into the sparse structure. Thus we calculate the sparse window attention of the dilated voxels, such that the voxels far from the surface would get low scores and do not contribute to the final shape. The dilated voxels are then joined to the downsampled volume by concatenating the voxels together. With this dilate-attention module, the 3D shape is prevented from collapsing. Without this module, the network performs badly and only generates a degraded shape. Global attention and global context encoding. Since the attention blocks in the top-down flow are all local-window based, there could be a lack of the global receptive field. Considering the resolution of the bottom level is not high, we equip with a global attention block at the bottom level, i.e., we calculate the attention between each non-empty voxel and any other non-empty voxel in the volume. This could build the long-range dependency missing in the sparse window attention blocks. In addition, we use the multi-scale global averaging pooling (Zhao et al., 2017) of scales 1, 2, 3 to extract the global context code of the scene. This encoding module could aggregate the global information and explain the illumination, global texture, and global geometry style.

3.5. LOSS FUNCTION

The final TSDF prediction S 0 is supervised by the log L1 distance between the prediction and the ground truth as L 0 = | log S 0 -log S|. To supervise the occupancy predictions O 2 , O 1 in the coarse and medium levels, we generate the occupancy volumes based on the TSDF values. Specifically, the voxels with TSDF of -1 ∼ 1 are regarded as occupied, and the values are set to 1, otherwise set to 0. Then a binary cross-entropy loss is calculated between the prediction and the ground truth as: L l = -O l log O l , l = 1, 2. To supervise the averaging weights W l i , we use the occupancy in the back-projection following Stier et al. (2021) . Intuitively, when the feature is back-projected from a 2D image to the 3D space along the camera ray using multiple depth values, we hope the voxels close to the mesh surface have bigger weights in the fusion. Therefore, the 3D position is regarded as occupied if the difference between the project depth and the true depth from the depth map is smaller than the TSDF truncation distance. Then the cross entropy loss is applied to the weights and the occupancy: L l w = -O l i log σ(W l i ), l = 1, 2, 3, where σ denotes Sigmoid, and O l i is the ground truth occupancy in the back-projection of image I i . 

4.1. EXPERIMENTS SETUP

Our work is implemented in Pytorch and trained on Nvidia V100 GPUs. The network is optimized with the Adam optimizer (β 1 = 0.9, β 2 = 0.999) with learning rate of 1 × 10 -4 . For a fair comparison with previous methods, the voxel size of the fine level is set to 4cm, and the TSDF truncation distance is set to triple the voxel size. Thus the voxel size of the medium and the coarse levels are 8 cm and 16 cm, respectively. For the balance of efficiency and receptive field, the window size of the sparse window attention is set to 10. For the view selection, we first follow Hou et al. (2019b) to remove the redundant views, i.e., a new incoming frame is added to the system only if its relative translation is greater than 0.1 m and the relative rotation angle is greater than 15 degree. Then if the number of the remaining views exceeds the upper limit, a random selection is adopted for memory efficiency. The view limit is set to 20 in the training, which means twenty images are input to the network for one iteration, while the limit for testing is set to 150. Our framework runs at an online speed of 75 FPS for the keyframes. Detailed efficiency experiments are reported in the supplemental materials. ScanNet (Dai et al., 2017 ) is a large-scale indoor dataset composed of 1613 RGB-D videos of 806 indoor scenes. We follow the official train/test split, where there are 1513 scans used for training and 100 scans used for testing. TUM-RGBD (Sturm et al., 2012) and ICL-NUIM (Handa et al., 2014) are also two datasets composed of RGB-D videos but with small-number scenes. Therefore, following previous methods (Stier et al., 2021) , we only perform the generalization evaluation of the model trained on ScanNet on these two datasets, where 13 scenes of TUM-RGBD and 8 scenes of ICL-NUIM are used.

4.2. EVALUATION

To compare with previous methods, we evaluate the proposed method on the ScanNet test set. The quantitative results are presented in Table 1 and the qualitative comparison are displayed in Figure 5 . We first directly evaluate the reconstructed meshes with the ground-truth meshes, and obtain a significant improvement from previous methods, improving from F-score = 0.641 to F-score = 0.705, as shown in Table 1 . Then following Bozic et al. (2021) , we add the same occlusion mask at evaluation to avoid penalizing a more complete reconstruction, which is because the ground-truth meshes are incomplete due to unobserved and occluded regions, while our method could reconstruct a more complete 3D shape, as shown in Figure 5 . This results in a more reasonable evaluation, as in the second part of Table 1 . The improvement is further enlarged, from F-score = 0.655 to F-score = 0.754 compared to previous best method. The accuracy error is decreased from 0.055 m to 0.032 m, which is almost half (41.8%) of the previous best method, while the completeness error is decreased by 25.3%, from 0.083 m to 0.062 m. This owes to the feature aggregating ability of the proposed 3D SDF transformer, which can predict a more accurate 3D shape. This is also demonstrated in the generalization experiments on ICL-NUIM and TUM-RGBD datasets, as shown in Figure 3 . After evaluating the reconstructed meshes, we also evaluate the depth accuracy of our method. Since our method does not predict the depth maps explicitly, we render the predicted 3D shape to the image planes and get the depth maps, following previous methods (Murez et al., 2020) . The results are shown in Table 2 , from which we can see our method decreases the error a lot from previous methods. The relative error is reduced by 16.4%, from 0.061 to 0.051. The accuracy of the depth maps also demonstrates the accurate feature analysis ability of the proposed 3D SDF transformer. From the qualitative visualization in Figure 5 , we can see our method can predict a complete and accurate 3D shape. Previous methods which can recover a complete mesh usually reconstruct a smooth 3D shape with losing some details (Murez et al., 2020) . However, our method could predict a more complete mesh than the ground truth, while the details of the 3D shapes are better recovered. Please note that for a fair comparison, the voxel size is set to 4 cm, such that it is hard to reconstruct the geometry details less than 4 cm. SDF transformer. To verify the effectiveness of the proposed SDF transformer, we first build a baseline model with the same structure as Figure 1 , but the 3D SDF transformer is replaced by a UNet structure of 3D CNN. Adding the variance fusion would improve the mesh in some clutter areas and slightly increase the performance. Then we add a base version of the SDF transformer, which does not include the global module and the postdilate-attention module. The performance is significantly improved with this module, as is shown in Table 4 and Figure 4 . The reconstructed meshes possess much more geometry details compared to the baseline.

4.3. ABLATION STUDY

Global module. We next add the global module, including the bottom-level global attention and the global context code. The sparse window attention block can only obtain the long-range dependency within a local window. Thus it may have problems when it can not get enough information within this local window, e.g., the texture-free regions. Also, the global module could reason the global information like the illumination and the texture style. Dilate attention. The dilate attention module is crucial in the SDF transformer, so we can not remove all the dilate attention blocks. That will destroy the whole framework and generate a degraded 3D shape. Therefore, we only ablate the post dilate attention block after the down-up flow. This block could deform the shape and make it more complete, e.g., making up the crack as shown in Figure 4 . From the quantitative results in Table 4 , we can also see the improvement of completeness. Window size. As shown in Table 4 , we study the impact of the window size of the attention. It is expected that a larger window size would generate a better result, since the range of the dependency is longer, but with the cost of more resource consumption. We choose 10 as the default size, considering that the performance improvement is minor after that.

A APPENDIX

A.1 MORE DETAILS Volume sparsify. The ground-truth occupancy volumes are generated based on the ground-truth TSDF volumes. The voxels with TSDF value of [-1, 1] are regarded as occupied and set to 1, otherwise set to 0. In the training or the inference, after the occupancy volume is predicted in the coarser level, the voxels of occupancy value less than 0.5 are regarded as empty and discarded, while the remaining voxels are regarded as occupied and transmitted to the next level. The sparse volume is stored in a hash table, where the key of the table is the hash value of the voxel, and the value of the table stores the corresponding feature. In the coarsest level, all voxels are regarded as non-empty and stored in the hash table, which does not consume much memory because the size of the volume is small. Training and inference. The training and inference are performed in a similar way. For a given sequence of images, first a view selection is performed to select images with translation greater than 0.1m and rotation greater than 15 degrees. Then a random selection is adopted from the remaining images if the number exceeds the upper limit. These images are then fed to the 2D backbone for feature extraction, after which the features are fused to a 3D volume and fed to the 3D part to produce the TSDF volume prediction. The final mesh is extracted by marching cubes from the TSDF volume. This process is the same as previous methods like Atlas, TransformerFusion or VORTX. For the number of the upper limit of the images, actually any number for the sequence length is okay for our framework, although more images lead to a better reconstruction of a scene. In our experiments, the number for training is set to 20 and the number for inference is set to 150. 

A.2 EFFICIENCY

The runtime analysis is presented in Table 5 . For a fair comparison to previous methods, the time is tested on a chunk of size 1.5×1.5×1.5 m 3 with an Nvidia RTX 3090 GPU. Our framework consists of two parts: one is the per-frame part, including the feature extraction of the 2D images; the other one is the per-scene part, including the feature fusion, 3D feature processing, and mesh generation. The per-frame model runs for every keyframe, i.e., it keeps running whenever a new keyframe comes. Differently, the per-scene model runs only once for generating a mesh reconstruction of a scene, i.e., it only works after all frames are fed, or when we need to output a mesh. Therefore, the online speed of a normal running is 75 FPS, which only performs the mesh generation once at the end.

A.3 METRICS

The definitions of the 2D metrics and 3D metrics used for evaluation are explained in Table 6 . A.4 LIMITATIONS Due to the volume representation, our framework is limited by the trade-off between the resolution of the volume and the memory consumption. A smaller voxel size would cost much more memory. The voxel size is set to 4 cm, such that the geometry details less than 4 cm are hard to be recovered. 

A.5 ROBUSTNESS TO THE POSE NOISE

Our method is based on the given accurate camera poses, which is the same as previous state-ofthe-art methods like Atlas (Murez et al., 2020) , NeuralRecon (Sun et al., 2021) , and Transformer-Fusion (Bozic et al., 2021) , where the camera poses are obtained by the standard SfM or SLAM systems. To inspect the robustness of our method to the pose errors, we add the Gaussian noise to the camera poses. A translation noise [N x , N y , N z ] of N = Gauss{0, σ T } is added to the translation of the pose, while a rotation noise [N roll , N pitch , N yaw ] of N = Gauss{0, σ R } is added to the three angles of the pose. The metrics following NeuralRecon (Sun et al., 2021) are reported in Table 7 . From the results, we can see our system can handle some translation errors but cannot handle the rotation errors well. But if the poses of only some frames are miscalculated, e.g., 10% of all frames, the performance decrease would be under control. As expected, a smaller voxel size leads to a more accurate reconstruction but consumes much more GPU memory. We have trained the models with voxel sizes of 2cm and 3cm, but it is hard to evaluate the models in the large scene of ScanNet test set, because the model of 2cm requires too much memory of the GPU. Thus we only compare them on a medium scene, i.e., Scene-709, as reported in Table 8 . The per-frame time remains unchanged while the per-scene time increases.



CONCLUSIONWe propose the first top-down-bottom-up 3D transformer for 3D scene reconstruction. A sparse window attention module is proposed to reduce the computation, a dilate attention module is proposed to avoid geometry degeneration, and a global module at the bottom level is employed to extract the global information. This structure could be used to aggregate any 3D feature volume, thus it could be applied to more 3D tasks in the future, such as 3D segmentation.



Figure 1: The overview of the 3D reconstruction framework. The input images are extracted to features by a 2D backbone network, then the 2D features are back-projected and fused to 3D feature volumes, which are aggregated by our 3D SDF transformer and generate the reconstruction in a coarse-to-fine manner.

Figure 2: (a) Illustration of the sparse window attention. For calculating the attention of the current voxel (in orange), we first sparsify the volume using the occupancy prediction from the coarser level, and then search the occupied voxels (in dark blue) within a small window. The attention is hence computed based on only these neighbor occupied voxels. (b) Illustration of the dilate-attention in a 2D slice. We dilate the occupied voxels and calculate the attention of these dilated voxels (in yellow) to maintain the geometry structure.

Figure 3: The structure of the SDF transformer. "S-W-Attn" denotes sparse window attention.

Ablation study on the ScanNet dataset.

Figure 5: The qualitative results on the ScanNet dataset. Texture-less rendering is displayed in the appendix.

Evaluation of the 3D meshes on ScanNet. The upper part follows the evaluation in Sun et al. (2021) while the lower part follows Bozic et al. (2021). The metric definitions are explained in the appendix. Method Abs Rel ↓ Abs Diff ↓ Sq Rel ↓ RMSE ↓ δ-1.25 ↑ δ-1.25 2 ↑ δ-1.25 3 ↑

Evaluation of the 2D depth maps on the ScanNet dataset. The upper part shows the results of depthbased methods, while the lower part shows volumetric methods, whose depths are rendered from the meshes.

Generalization experiments on the ICL-NUIM and TUM-RGBD datasets.

Method Acc ↓ Comp ↓ Prec ↑ Recall ↑ F-score ↑ Ablation study on the ScanNet dataset. Components are added one by one in the upper part.

Efficiency experiments.

Published as a conference paper at ICLR 2023 , d * d ) < 1.25 i ) Acc meanp∈P (minp * ∈P * ||p -p * ||) Comp meanp * ∈P * (minp∈ ||p -p * ||) Prec meanp∈P (minp * ∈P * ||p -p * || < 0.05) Recall meanp * ∈P * (minp∈ ||p -p * || < 0.05) Metric definitions. n denotes the number of pixels with both valid ground truth and prediction, d and d * denote the predicted and the ground-truth depths, p and p * denote the predicted and the ground-truth point clouds.

Experiments with pose noise following NeuralRecon(Sun et al., 2021) metrics.

A.7 MORE RESULTS

The texture-less rendering of the ground-truth meshes is shown in Figure 6 . More results are presented in Figure 7 and Figure 8 .

