S-NERF: NEURAL RADIANCE FIELDS FOR STREET VIEWS

Abstract

Neural Radiance Fields (NeRFs) aim to synthesize novel views of objects and scenes, given the object-centric camera views with large overlaps. However, we conjugate that this paradigm does not fit the nature of the street views that are collected by many self-driving cars from the large-scale unbounded scenes. Also, the onboard cameras perceive scenes without much overlapping. Thus, existing NeRFs often produce blurs, "floaters" and other artifacts on street-view synthesis. In this paper, we propose a new street-view NeRF (S-NeRF) that considers novel view synthesis of both the large-scale background scenes and the foreground moving vehicles jointly. Specifically, we improve the scene parameterization function and the camera poses for learning better neural representations from street views. We also use the the noisy and sparse LiDAR points to boost the training and learn a robust geometry and reprojection based confidence to address the depth outliers. Moreover, we extend our S-NeRF for reconstructing moving vehicles that is impracticable for conventional NeRFs. Thorough experiments on the large-scale driving datasets (e.g., nuScenes and Waymo) demonstrate that our method beats the state-of-the-art rivals by reducing 7∼ 40% of the mean-squared error in the street-view synthesis and a 45% PSNR gain for the moving vehicles rendering.

1. INTRODUCTION

Neural Radiance Fields (Mildenhall et al., 2020) have shown impressive performance on photorealistic novel view rendering. However, original NeRF is usually designed for object-centric scenes and require camera views to be heavily overlapped (as shown in Figure 1(a) ). Recently, more and more street view data are collected by self-driving cars. The reconstruction and novel view rendering for street views can be very useful in driving simulation, data generation, AR and VR. However, these data are often collected in the unbounded outdoor scenes (e.g. nuScenes (Caesar et al., 2019) and Waymo (Sun et al., 2020) datasets). The camera placements of such data acquisition systems are usually in a panoramic settings without object-centric camera views (Figure 1(b) ). Moreover, the overlaps between adjacent camera views are too small to be effective for training NeRFs. Since the ego car is moving fast, some objects or contents only appear in a limited number of image views. (e.g. Most of the vehicles need to be reconstructed from just 2 ∼ 6 views.) All these problems make it difficult to optimize existing NeRFs for street-view synthesis. MipNeRF-360 (Barron et al., 2022) is designed for training in unbounded scenes. BlockNeRF (Tancik et al., 2022) proposes a block-combination strategy with refined poses, appearances, and exposure on the MipNeRF (Barron et al., 2021) base model for processing large-scale outdoor scenes. However, they still require enough intersected camera rays (Figure 1(a) ) and large overlaps across different cameras (e.g. Block-NeRF uses a special system with twelve cameras for data acquisition to guarantee enough overlaps between different camera views). They produce many blurs, "floaters" and other artifacts when training on existing self-driving datasets (e.g. nuScenes (Caesar et al., 2019) and Waymo (Sun et al., 2020) , as shown in Figure 2(a) ). Urban-NeRF (Rematas et al., 2022) takes accurate dense LiDAR depth as supervision for the reconstruction of urban scenes. However, these dense LiDAR depths are difficult and expensive to collect. In this paper, we contribute a new NeRF design (S-NeRF) for the novel view synthesis of both the large-scale (background) scenes and the foreground moving vehicles. Different from other largescale NeRFs (Tancik et al., 2022; Rematas et al., 2022) , our method does not require specially designed data acquisition platform used in them. Our S-NeRF can be trained on the standard selfdriving datasets (e.g. nuScenes (Caesar et al., 2019) and Waymo (Sun et al., 2020) ) that are collected by common self-driving cars with fewer cameras and noisy sparse LiDAR points to synthesize novel street views. We improve the scene parameterization function and the camera poses for learning better neural representations from street views. We also develop a novel depth rendering and supervision method using the noisy sparse LiDAR signals to effectively train our S-NeRF for street-view synthesis. To deal with the depth outliers, we propose a new confidence metric learned from the robust geometry and reprojection consistencies. Not only for the background scenes, we further extend our S-NeRF for high-quality reconstruction of the moving vehicles (e.g. moving cars) using the proposed virtual camera transformation. In the experiments, we demonstrate the performance of our S-NeRF on the standard driving datasets (Caesar et al., 2019; Sun et al., 2020) . For the static scene reconstruction, our S-NeRF far outperforms the large-scale NeRFs (Barron et al., 2021; 2022; Rematas et al., 2022) . It reduces the mean-squared error by 7 ∼ 40% and produces impressive depth renderings (Figure 2 (b)). For the foreground objects, S-NeRF is shown capable of reconstructing moving vehicles in high quality, which is impracticable for conventional NeRFs (Mildenhall et al., 2020; Barron et al., 2021; Deng et al., 2022) . It also beats the latest mesh-based reconstruction method Chen et al. (2021b), improving the PSNR by 45% and the structure similarity by 18%. Learning-based approaches have been widely used in 3D scene and object reconstruction (Sitzmann et al., 2019; Xu et al., 2019; Engelmann et al., 2021) . They encode the feature through a deep neural network and learn various geometry representations, such as voxels (Kar et al., 2017; Sitzmann et al., 2019 ), patches (Groueix et al., 2018) and meshes (Wang et al., 2018; Chen et al., 2021b) .

2.2. NEURAL RADIANCE FIELDS

Neural Radiance Fields (NeRF) is proposed in (Mildenhall et al., 2020) as an implicit neural representation for novel view synthesis. Various types of NeRFs haven been proposed for acceleration. (Yu et al., 2021a; Rebain et al., 2021) , better generalization abilities (Yu et al., 2021b; Trevithick & 



Figure 1: Problem illustration. (a) Conventional NeRFs Mildenhall et al. (2020); Barron et al. (2021) require object-centric camera views with large overlaps. (b) In the challenging large-scale outdoor driving scenes Caesar et al. (2019); Sun et al. (2020)), the camera placements for data collection are usually in a panoramic view settings. Rays from different cameras barely intersect with others in the unbounded scenes. The overlapped field of view between adjacent cameras is too small to be effective for training the existing NeRF models. As shown in Figure 3(a), data Caesar et al. (2019); Sun et al. (2020) collected by self-driving cars can not be used for training Urban-NeRF because they only acquire sparse LiDAR points with plenty of outliers when projected to images (e.g. only 2∼5K points are captured for each nuScenes image).

novel view rendering(Agarwal et al., 2011)  often rely on Structurefrom-Motion (SfM), multi-view stereo and graphic rendering(Losasso & Hoppe, 2004).

