DENSE RGB SLAM WITH NEURAL IMPLICIT MAPS

Abstract

There is an emerging trend of using neural implicit functions for map representation in Simultaneous Localization and Mapping (SLAM). Some pioneer works have achieved encouraging results on RGB-D SLAM. In this paper, we present a dense RGB SLAM method with neural implicit map representation. To reach this challenging goal without depth input, we introduce a hierarchical feature volume to facilitate the implicit map decoder. This design effectively fuses shape cues across different scales to facilitate map reconstruction. Our method simultaneously solves the camera motion and the neural implicit map by matching the rendered and input video frames. To facilitate optimization, we further propose a photometric warping loss in the spirit of multi-view stereo to better constrain the camera pose and scene geometry. We evaluate our method on commonly used benchmarks and compare it with modern RGB and RGB-D SLAM systems. Our method achieves favorable results than previous methods and even surpasses some recent RGB-D SLAM methods. The code is at poptree.github.io/DIM-SLAM/.

1. INTRODUCTION

Visual SLAM is a fundamental task in 3D computer vision with many applications in AR/VR and robotics. The goal of visual SLAM is to estimate the camera poses and build a 3D map of the environment simultaneously from visual inputs. Visual SLAM methods can be primarily divided into sparse or dense according to their reconstructed 3D maps. Sparse methods (Mur-Artal & Tardós, 2017; Engel et al., 2017) focus on recovering camera motion with a set of sparse or semi-dense 3D points. Dense works (Newcombe et al., 2011b) seek to recover the depth of every observed pixel and are often more desirable for many downstream applications such as occlusion in AR/VR or obstacle detection in robotics. Earlier methods (Newcombe et al., 2011a; Whelan et al., 2012) often resort to RGB-D cameras for dense map reconstruction. However, RGB-D cameras are more suitable to indoor scenes and more expensive because of the specialized sensors. iMAP (Sucar et al., 2021) , introduces an implicit map representation for dense visual SLAM. This map representation is more compact, continuous, and allowing for prediction of unobserved areas, which could potentially benefit applications like path planning (Shrestha et al., 2019) and object manipulation (Sucar et al., 2020) . However, as observed in NICE-SLAM (Zhu et al., 2022 ), iMAP (Sucar et al., 2021) is limited to room-scale scenes due to the restricted representation power of MLPs. NICE-SLAM (Zhu et al., 2022) introduces a hierarchical feature volume to facilitate the map reconstruction and generalize the implicit map to larger scenes. However, both iMAP (Sucar et al., 2021) and NICE-SLAM (Zhu et al., 2022) are limited to RGB-D cameras. This paper presents a novel dense visual SLAM method with regular RGB cameras based on the implicit map representation. We also adopt a hierarchical feature volume like NICE-SLAM to deal * Equal contribution 1



Another important problem in visual SLAM is map representation. Sparse SLAM methods(Mur- Artal & Tardós, 2017; Engel et al., 2017)  typically use point clouds for map representation, while dense methods (Newcombe et al., 2011b;a) usually adopt triangle meshes. As observed in many recent geometry processing works(Mescheder et al., 2019; Park et al., 2019; Chen & Zhang, 2019), neural implicit function offers a promising presentation for 3D data processing. The pioneer work,

