3D FORMER: MONOCULAR SCENE RECONSTRUCTION WITH 3D SDF TRANSFORMERS

Abstract

Monocular scene reconstruction from posed images is challenging due to the complexity of a large environment. Recent volumetric methods learn to directly predict the TSDF volume and have demonstrated promising results in this task. However, most methods focus on how to extract and fuse the 2D features to a 3D feature volume, but none of them improve the way how the 3D volume is aggregated. In this work, we propose an SDF transformer network, which replaces the role of 3D CNN for better 3D feature aggregation. To reduce the explosive computation complexity of the 3D multi-head attention, we propose a sparse window attention module, where the attention is only calculated between the non-empty voxels within a local window. Then a top-down-bottom-up 3D attention network is built for 3D feature aggregation, where a dilate-attention structure is proposed to prevent geometry degeneration, and two global modules are employed to equip with global receptive fields. The experiments on multiple datasets show that this 3D transformer network generates a more accurate and complete reconstruction, which outperforms previous methods by a large margin. Remarkably, the mesh accuracy is improved by 41.8%, and the mesh completeness is improved by 25.3% on the ScanNet dataset.

1. INTRODUCTION

Monocular 3D reconstruction is a classical task in computer vision and is essential for numerous applications like autonomous navigation, robotics, and augmented/virtual reality. Such a vision task aims to reconstruct an accurate and complete dense 3D shape of an unstructured scene from only a sequence of monocular RGB images. While the camera poses can be estimated accurately with the state-of-the-art SLAM (Campos et al., 2021) or SfM systems (Schonberger & Frahm, 2016) , a dense 3D scene reconstruction from these posed images is still a challenging problem due to the complex geometry of a large-scale environment, such as the various objects, flexible lighting, reflective surfaces, and diverse cameras of different focus, distortion, and sensor noise. Many previous methods reconstruct the scenario in a multi-view depth manner (Yao et al., 2018; Chen et al., 2019; Duzceker et al., 2021) . They predict the dense depth map of each target frame, which can estimate accurate local geometry but need additional efforts in fusing these depth maps (Murez et al., 2020; Sun et al., 2021) , e.g., solving the inconsistencies between different views. Recently, some methods have tried to directly regress the complete 3D surface of the entire scene (Murez et al., 2020; Sun et al., 2021) from a truncated signed distance function (TSDF) representation. They first extract the 2D features with 2D convolutional neural networks (CNN), and then back-project the features to 3D space. Afterward, the 3D feature volume is processed by a 3D CNN network to output a TSDF volume prediction, which is extracted to a surface mesh by marching cubes (Lorensen & Cline, 1987) . This way of reconstruction is end-to-end trainable, and is demonstrated to output accurate, coherent, and complete meshes. In this paper, we follow this volume-based 3D reconstruction path and directly regress the TSDF volume. Inspired by recent successes of vision transformer (Vaswani et al., 2017; Dosovitskiy et al., 2020) , some approaches (Bozic et al., 2021; Stier et al., 2021 ) have adopted this structure in 3D reconstruction, but their usages are all limited to fusing the 2D features from different views while the aggregation of the 3D feature volumes is still performed by the 3D CNN. In this paper, we claim that the aggregation of 3D feature volume is also critical, and the evolution from 3D CNN to 3D multi-head attention could further improve both the accuracy and completeness of the reconstruction. Obviously, the limited usage of 3D multi-head attention in 3D feature volume aggregation is mainly due to its explosive computation. Specifically, the attention between each voxel and any other voxel needs to be calculated, which is hard to be realized in a general computing platform. This is also the reason why there are only a few applications of 3D transformers in solving 3D tasks. In this work, to address the above challenges and make the 3D transformer practical for 3D scene reconstruction, we propose a sparse window multi-head attention structure. Inspired by the sparse CNN (Yan et al., 2018) , we first sparsify the 3D feature volume with predicted occupancy, in which way the number of the voxels is reduced to only the occupied ones. Then, to compute the attention score of a target voxel, we define a local window centered on this voxel, within which the non-empty voxels are considered for attention computing. In this way, the computation complexity of the 3D multi-head attention can be reduced by orders of magnitude, and this module can be embedded into a network for 3D feature aggregation. Therefore, with this module, we build the first 3D transformer based top-down-bottom-up network, where a dilate-attention module and its inverse are used to downsample and upsample the 3D feature volume. In addition, to make up for the local receptive field of the sparse window attention, we add a global attention module and a global context module at the bottom of this network since the size of the volume is very small at the bottom level. With this network, the 3D shape is estimated in a coarse-to-fine manner of three levels, as is displayed in Figure 1 . To the best of our knowledge, this is the first paper employing the 3D transformer for 3D scene reconstruction from a TSDF representation. In the experiments, our method is demonstrated to outperform previous methods by a significant margin on multiple datasets. Specifically, the accuracy metric of the mesh on the ScanNet dataset is reduced by 41.8%, from 0.055 to 0.032, and the completeness metric is reduced by 25.3%, from 0.083 to 0.062. In the qualitative results, the meshes reconstructed by our method are dense, accurate, and complete. The main contributions of this work are then summarized as follows: • We propose a sparse window multi-head attention module, with which the computation complexity of the 3D transformer is reduced significantly and becomes feasible. • We propose a dilate-attention structure to avoid geometry degeneration in downsampling, with which we build the first top-down-bottom-up 3D transformer network for 3D feature aggregation. This network is further improved with bottom-level global attention and global context encoding. • This 3D transformer is employed to aggregate the 3D features back-projected from the 2D features of an image sequence in a coarse-to-fine manner, and predict TSDF values for accurate and complete 3D reconstruction. This framework shows a significant improvement in multiple datasets.

2. RELATED WORK

Depth-based 3D Reconstruction. In traditional methods, reconstructing a 3D model of a scene usually involves depth estimating for a series of images, and then fusing these depths together into



Figure 1: The overview of the 3D reconstruction framework. The input images are extracted to features by a 2D backbone network, then the 2D features are back-projected and fused to 3D feature volumes, which are aggregated by our 3D SDF transformer and generate the reconstruction in a coarse-to-fine manner.

