S-NERF: NEURAL RADIANCE FIELDS FOR STREET VIEWS

Abstract

Neural Radiance Fields (NeRFs) aim to synthesize novel views of objects and scenes, given the object-centric camera views with large overlaps. However, we conjugate that this paradigm does not fit the nature of the street views that are collected by many self-driving cars from the large-scale unbounded scenes. Also, the onboard cameras perceive scenes without much overlapping. Thus, existing NeRFs often produce blurs, "floaters" and other artifacts on street-view synthesis. In this paper, we propose a new street-view NeRF (S-NeRF) that considers novel view synthesis of both the large-scale background scenes and the foreground moving vehicles jointly. Specifically, we improve the scene parameterization function and the camera poses for learning better neural representations from street views. We also use the the noisy and sparse LiDAR points to boost the training and learn a robust geometry and reprojection based confidence to address the depth outliers. Moreover, we extend our S-NeRF for reconstructing moving vehicles that is impracticable for conventional NeRFs. Thorough experiments on the large-scale driving datasets (e.g., nuScenes and Waymo) demonstrate that our method beats the state-of-the-art rivals by reducing 7∼ 40% of the mean-squared error in the street-view synthesis and a 45% PSNR gain for the moving vehicles rendering.

1. INTRODUCTION

Neural Radiance Fields (Mildenhall et al., 2020) have shown impressive performance on photorealistic novel view rendering. However, original NeRF is usually designed for object-centric scenes and require camera views to be heavily overlapped (as shown in Figure 1  (a)). Recently, more and more street view data are collected by self-driving cars. The reconstruction and novel view rendering for street views can be very useful in driving simulation, data generation, AR and VR. However, these data are often collected in the unbounded outdoor scenes (e.g. nuScenes (Caesar et al., 2019) and Waymo (Sun et al., 2020) datasets). The camera placements of such data acquisition systems are usually in a panoramic settings without object-centric camera views (Figure 1(b) ). Moreover, the overlaps between adjacent camera views are too small to be effective for training NeRFs. Since the ego car is moving fast, some objects or contents only appear in a limited number of image views. (e.g. Most of the vehicles need to be reconstructed from just 2 ∼ 6 views.) All these problems make it difficult to optimize existing NeRFs for street-view synthesis. MipNeRF-360 (Barron et al., 2022) is designed for training in unbounded scenes. BlockNeRF (Tancik et al., 2022) proposes a block-combination strategy with refined poses, appearances, and exposure on the MipNeRF (Barron et al., 2021) base model for processing large-scale outdoor scenes. However, they still require enough intersected camera rays (Figure 1(a) ) and large overlaps across different cameras (e.g. Block-NeRF uses a special system with twelve cameras for data acquisition to guarantee enough overlaps between different camera views). They produce many blurs, "floaters" and other artifacts when training on existing self-driving datasets (e.g. nuScenes (Caesar et al., 2019) and Waymo (Sun et al., 2020) , as shown in Figure 2(a) ). Urban-NeRF (Rematas et al., 2022) takes accurate dense LiDAR depth as supervision for the reconstruction of urban scenes. However, these dense LiDAR depths are difficult and expensive to collect.

