DENSE RGB SLAM WITH NEURAL IMPLICIT MAPS

Abstract

There is an emerging trend of using neural implicit functions for map representation in Simultaneous Localization and Mapping (SLAM). Some pioneer works have achieved encouraging results on RGB-D SLAM. In this paper, we present a dense RGB SLAM method with neural implicit map representation. To reach this challenging goal without depth input, we introduce a hierarchical feature volume to facilitate the implicit map decoder. This design effectively fuses shape cues across different scales to facilitate map reconstruction. Our method simultaneously solves the camera motion and the neural implicit map by matching the rendered and input video frames. To facilitate optimization, we further propose a photometric warping loss in the spirit of multi-view stereo to better constrain the camera pose and scene geometry. We evaluate our method on commonly used benchmarks and compare it with modern RGB and RGB-D SLAM systems. Our method achieves favorable results than previous methods and even surpasses some recent RGB-D SLAM methods. The code is at poptree.github.io/DIM-SLAM/.

1. INTRODUCTION

Visual SLAM is a fundamental task in 3D computer vision with many applications in AR/VR and robotics. The goal of visual SLAM is to estimate the camera poses and build a 3D map of the environment simultaneously from visual inputs. Visual SLAM methods can be primarily divided into sparse or dense according to their reconstructed 3D maps. Sparse methods (Mur-Artal & Tardós, 2017; Engel et al., 2017) focus on recovering camera motion with a set of sparse or semi-dense 3D points. Dense works (Newcombe et al., 2011b) seek to recover the depth of every observed pixel and are often more desirable for many downstream applications such as occlusion in AR/VR or obstacle detection in robotics. Earlier methods (Newcombe et al., 2011a; Whelan et al., 2012) often resort to RGB-D cameras for dense map reconstruction. However, RGB-D cameras are more suitable to indoor scenes and more expensive because of the specialized sensors. Another important problem in visual SLAM is map representation. Sparse SLAM methods (Mur-Artal & Tardós, 2017; Engel et al., 2017) typically use point clouds for map representation, while dense methods (Newcombe et al., 2011b; a) usually adopt triangle meshes. As observed in many recent geometry processing works (Mescheder et al., 2019; Park et al., 2019; Chen & Zhang, 2019) , neural implicit function offers a promising presentation for 3D data processing. The pioneer work, iMAP (Sucar et al., 2021) , introduces an implicit map representation for dense visual SLAM. This map representation is more compact, continuous, and allowing for prediction of unobserved areas, which could potentially benefit applications like path planning (Shrestha et al., 2019) and object manipulation (Sucar et al., 2020) . However, as observed in NICE-SLAM (Zhu et al., 2022 ), iMAP (Sucar et al., 2021) is limited to room-scale scenes due to the restricted representation power of MLPs. NICE-SLAM (Zhu et al., 2022) introduces a hierarchical feature volume to facilitate the map reconstruction and generalize the implicit map to larger scenes. However, both iMAP (Sucar et al., 2021) and NICE-SLAM (Zhu et al., 2022) are limited to RGB-D cameras. This paper presents a novel dense visual SLAM method with regular RGB cameras based on the implicit map representation. We also adopt a hierarchical feature volume like NICE-SLAM to deal with larger scenes. But our formulation is more suitable for visual SLAM. Firstly, the decoders in NICE-SLAM (Zhu et al., 2022) are pretrained, which might cause problems when generalizing to different scenes (Yang et al., 2021) , while our method learns the scene features and decoders together on the fly to avoid generalization problem Secondly, NICE-SLAM (Zhu et al., 2022) computes the occupancy at each point from features at different scales respectively and then sums these occupancies together, while we fuse features from all scales to compute the occupancy at once. In this way, our optimization becomes much faster and thus can afford to use more pixels and iterations, enabling our framework to work on the RGB setting. In experiments, we find the number of feature hierarchy is important to enhance the system accuracy and robustness. Intuitively, features from fine volumes capture geometry details, while features from coarse volumes enforce geometry regularity like smoothness or planarity. While NICE-SLAM (Zhu et al., 2022) only optimizes two feature volumes with voxel sizes of 32cm and 16cm, our method solves six feature volumes from 8cm to 64cm. Our fusion of features across many different scales leads to more robust and accurate tracking and mapping as demonstrated in experiments. Another challenge in our setting is that there are no input depth observations. Therefore, we design a sophisticated warping loss to further constrain the camera motion and scene map in the same spirit of multi-view stereo (Zheng et al., 2014; Newcombe et al., 2011b; Wang et al., 2021b; Yu et al., 2021) . Specifically, we warp one frame to other nearby frames according to the estimated scene map and camera poses and optimize the solution to minimize the warping loss. However, this warping loss is subject to view-dependent intensity changes such as specular reflections. To address this problem, we carefully sample image pixels visible in multiple video frames and evaluate the structural similarity of their surrounding patches to build a robust system. We perform extensive evaluations on three different datasets and achieve state-of-the-art performance on both mapping and camera tracking. Our method even surpasses recent RGB-D based methods like iMAP (Sucar et al., 2021) and NICE-SLAM (Zhu et al., 2022) on camera tracking. Our contributions can be summarized as the following: • We design the first dense RGB SLAM with neural implicit map representation, • We introduce a hierarchical feature volume for better occupancy evaluation and a multiscale patchbased warping loss to boost system performance with only RGB inputs, • We achieve strong results on benchmark datasets and even surpass some recent RGB-D methods.

2. RELATED WORK

Visual SLAM Many visual SLAM algorithms and systems have been developed the last two decades. We only quickly review some of the most relevant works, and more comprehensive surveys can be found at (Cadena et al., 2016; Macario Barros et al., 2022) . Sparse visual SLAM algorithms (Klein & Murray, 2007; Mur-Artal & Tardós, 2017) focus on solving accurate camera poses and only recover a sparse set of 3D landmarks serving for camera tracking. Semi-dense methods like LSD-SLAM (Engel et al., 2014) and DSO (Engel et al., 2017) achieve more robust tracking in textureless scenes by reconstructing the semi-dense pixels with strong image gradients. In comparison, dense visual SLAM algorithms (Newcombe et al., 2011b) aim to solve the depth of every observed pixel, which is very challenging, especially in featureless regions. In the past, dense visual SLAM is often solved with RGB-D sensors. KinectFusion (Newcombe et al., 2011a) and the follow-up works (Whelan et al., 2012; 2021; Dai et al., 2017) register the input sequence of depth images through Truncated Signed Distance Functions (TSDFs) to track camera motion and recover a clean scene model at the same time. Most recently, iMAP and NICE-SLAM, introduce neural implicit functions as the map presentation and achieve better scene completeness, especially for unobserved regions. All these methods are limited to RGB-D cameras. More recently, deep neural networks are employed to solve dense depth from regular RGB cameras. Earlier methods (Zhou et al., 2017; Ummenhofer et al., 2017; Zhou et al., 2018) directly regress camera ego-motion and scene depth from input images. Later, multi-view geometry constraints are enforced in (Bloesch et al., 2018; Tang & Tan, 2019; Teed & Deng, 2020; Wei et al., 2020; Teed & Deng, 2021) for better generalization and accuracy. But they all only recover depth maps instead of complete scene models. Our method also employs deep neural networks to solve the challenging

