VIEW SYNTHESIS WITH SCULPTED NEURAL POINTS

Abstract

We address the task of view synthesis, generating novel views of a scene given a set of images as input. In many recent works such as NeRF (Mildenhall et al., 2020), the scene geometry is parameterized using neural implicit representations (i.e., MLPs). Implicit neural representations have achieved impressive visual quality but have drawbacks in computational efficiency. In this work, we propose a new approach that performs view synthesis using point clouds. It is the first pointbased method that achieves better visual quality than NeRF while being 100× faster in rendering speed. Our approach builds on existing works on differentiable point-based rendering but introduces a novel technique we call "Sculpted Neural Points (SNP)", which significantly improves the robustness to errors and holes in the reconstructed point cloud. We further propose to use view-dependent point features based on spherical harmonics to capture non-Lambertian surfaces, and new designs in the point-based rendering pipeline that further boost the performance. Finally, we show that our system supports fine-grained scene editing.

1. INTRODUCTION

We address the task of view synthesis: generating novel views of a scene given a set of images as input. It has important applications including augmented and virtual reality. View synthesis can be posed as the task of recovering from existing images a rendering function that maps an arbitrary viewpoint into an image. In many recent works, this rendering function is parameterized using neural implicit representations of scene geometry (Mildenhall et al., 2020; Yu et al., 2021c; Park et al., 2021; Garbin et al., 2021; Niemeyer et al., 2021) . In particular, NeRF (Mildenhall et al., 2020) represents 3D geometry as a neural network that maps a 3D coordinate to a scalar indicating occupancy. Implicit neural representations have achieved impressive visual quality but are typically computationally inefficient. To render a single pixel, NeRF needs to evaluate the neural network at hundreds of 3D points along the ray, which is wasteful because most of the 3D spaces are unoccupied. NeRF's implicit representation also makes it inflexible for scene editing operations such as deformation, which is important for downstream applications including augmented reality and video games. Several works enable NeRF to do scene editing (Lombardi et al., 2019; Liu et al., 2021; Yang et al., 2021a; Pumarola et al., 2021) , but either the way of editing is highly constrained, or images captured under all desired object poses are required. On the other hand, this limitation is easily overcome by explicit representations such as meshes or point clouds. To rasterize a mesh or a point cloud, no computation is wasted on unoccupied 3D spaces. Scene editing operations such as composition and deformation is also straightforward. Moreover, rasterizing meshes or point clouds is a mature technology already widely deployed in the industry for movies and video games, capable of producing real-time performance and high realism. An intriguing question is whether we can achieve state-of-the-art visual quality by using explicit representations such as point clouds. The basic framework of point-based neural rendering is to represent the scene as a featurized point cloud, which is reconstructed through a multiview stereo (MVS) system. The features are learned by maximizing photoconsistency on the input images via differentiable rendering. Although this framework has been studied in several recent works (Aliev et al., 2020; Wiles et al., 2020; Lassner & Zollhofer, 2021) , the overall rendering quality still lags behind NeRF, mainly due to the ghosting effects and blurriness caused by the errors in geometry. Our approach adopts this basic framework but introduces a novel technique we call "Sculpted Neural Points (SNP)", which significantly improves the robustness to the errors and holes in the reconstructed point cloud. The idea is to "sculpt" the initial point cloud reconstructed by the MVS system. In particular, we remove existing points and add additional points to improve the photo-consistency of the renders against input images. These sculpting decisions are discrete in nature, but are tightly coupled with gradient-based optimization of the continuous per-point features. We further propose a few novel designs in the point-based rendering pipeline that boost the performance. We use spherical harmonics (SH) in high-dimensional point feature space to capture non-Lambertian visual effects, which is faster and better than using MLPs. Inspired by Dropout (Srivastava et al., 2014) , we propose a point dropout layer that significantly improves the generalization to novel views. Last but not least, we find that it is essential to not use any normalization layers in the U-Net. Compared to previous works that use point cloud-based representation, ours is the first model that achieves better rendering quality than NeRF, while being 100× faster in rendering, and reducing the training time by 66%. We evaluate our method on common benchmarks including DTU (Jensen et al., 2014) , LLFF (Mildenhall et al., 2019 ), NeRF-Synthetic (Mildenhall et al., 2020 ), and Tanks&Temples (Knapitsch et al., 2017) , and our method shows better or comparable performance against all baselines. Finally, we show that our model allows fine-grained scene editing in a user-friendly way. Compared to previous works that can only do object-level composition (Lombardi et al., 2019; Yu et al., 2021b; Yang et al., 2021b) or require a special user interface (Liu et al., 2021) , our point-based representation inherently supports editing at a finer resolution, and users can use existing graphics toolbox to edit the point cloud without any custom changes. The main contributions of this paper are three-fold: 1) We propose a novel point-based approach to view synthesis, "Sculpted Neural Points", a technique that is key to achieving high quality and viewconsistent output; 2) We demonstrate, for the first time, that a point-based method can achieve better visual quality than NeRF while being 100× faster in rendering. 3) We propose several improvements to the point-based rendering pipeline that significantly boost the visual quality.

2. RELATED WORK

Methods for view synthesis can be categorized based on how they represent the scene geometry. View Synthesis with Implicit Representations NeRF (Mildenhall et al., 2020) uses a neural network to map a 3D spatial location to volume density. To render a pixel, the neural network needs to be repeatedly evaluated along the ray, making rendering computationally expensive. Followup works on NeRF (Yu et al., 2021c; b; Park et al., 2021) focus on improving the speed or the general-



Figure 1: The overall pipeline of the Sculpted Neural Points. We first use an MVS network to extract a point cloud. We then sculpt the point cloud by pruning (blue points) and adding (red points). The featurized point cloud finally passes through a differentiable rendering module to produce the image.

