GECONERF: FEW-SHOT NEURAL RADIANCE FIELDS VIA GEOMETRIC CONSISTENCY

Abstract

We present a novel framework to regularize Neural Radiance Field (NeRF) in a few-shot setting with a geometry-aware consistency regularization. The proposed approach leverages a rendered depth map at unobserved viewpoint to warp sparse input images to the unobserved viewpoint and impose them as pseudo ground truths to facilitate learning of NeRF. By encouraging such geometry-aware consistency at a feature-level instead of using pixel-level reconstruction loss, we regularize the NeRF at semantic and structural levels while allowing for modeling viewdependent radiance to account for color variations across viewpoints. We also propose an effective method to filter out erroneous warped solutions, along with training strategies to stabilize training during optimization. We show that our model achieves competitive results compared to state-of-the-art few-shot NeRF models.

1. INTRODUCTION

Recently, representing a 3D scene as a Neural Radiance Field (NeRF) Mildenhall et al. (2020) has proven to be a powerful approach for novel view synthesis and 3D reconstruction Barron et al. (2021) ; Jain et al. (2021) ; Chen et al. (2021) . However, despite its impressive performance, NeRF requires a large number of densely, well distributed calibrated images for optimization, which limits its applicability. When limited to sparse observations, NeRF easily overfits to the input view images and is unable to reconstruct correct geometry Zhang et al. (2020) . The task that directly addresses this problem, also called a few-shot NeRF, aims to optimize highfidelity neural radiance field in such sparse scenarios Jain et al. ( 2021 2022), but their necessity for handcrafted methods or inability to extract local and fine structures limited their performance. To alleviate these issues, we propose a novel regularization technique that enforces a geometric consistency across different views with a depth-guided warping and a geometry-aware consistency modeling. Based on these, we propose a novel framework, called Neural Radiance Fields with Geometric Consistency (GeCoNeRF), for training neural radiance fields in a few-shot setting. Our key insight is that we can leverage a depth rendered by NeRF to warp sparse input images to novel viewpoints, and use them as pseudo ground truths to facilitate learning of fine details and highfrequency features by NeRF. By encouraging images rendered at novel views to model warped images with a consistency loss, we can successfully constrain both geometry and appearance to boost fidelity of neural radiance fields even in highly under-constrained few-shot setting. Taking into consideration non-Lambertian nature of given datasets, we propose a feature-level regularization loss that captures contextual and structural information while allowing for modeling view-dependent color differences. We also present a method to generate a consistency mask to prevent inconsistently warped information from harming the network. Finally, we provide coarse-to-fine training strategies for sampling and pose generation to stabilize optimization of the model. We demonstrate the effectiveness of our method on synthetic and real datasets Mildenhall et al. 2022) constrains the density's entropy in each ray and ensures consistency across rays in the neighborhood. While these methods constrain NeRF into learning more realistic geometry, their regularizations are limited in that they require extensive dataset-specific fine-tuning and that they only provide regularization at a global level in a generalized manner. Improving upon above works, our method tackles prior-free few-shot optimization without using any depth priors, achieving more local and scene-specific regularization with warping-based consistency modeling. Self-supervised photometric consistency. In the field of multiview stereo depth estimation, consistency modeling between stereo images and their warped images has been widely used for self-



); Kim et al. (2022); Niemeyer et al. (2022), countering the underconstrained nature of said problem by introducing additional priors. Specifically, previous works attempted to solve this by utilizing a semantic feature Jain et al. (2021), entropy minimization Kim et al. (2022), SfM depth priors Deng et al. (2022) or normalizing flow Niemeyer et al. (

Figure1: Illustration of our consistency modeling pipeline for few-shot NeRF. Given an image I i and estimated depth map D j of j-th unobserved viewpoint, we warp the image I i to that novel viewpoint as I i→j by establishing geometric correspondence between two viewpoints. Using the warped image as a pseudo ground truth, we cause rendered image of unseen viewpoint, I j , to be consistent in structure with warped image, with occlusions taken into consideration.2 RELATED WORKNeural radiance fields. Among the most notable of approaches regarding the task of novel view synthesis and 3D reconstruction is Neural Radiance Field (NeRF)Mildenhall et al. (2020), where photo-realistic images are rendered by a simple MLP architecture. Sparked by its impressive performance, a variety of follow-up studies based on its continuous neural volumetric representation have been prompted, including dynamic and deformable scenes Park et al. (2021); Tretschk et al. (2021); Pumarola et al. (2021); Attal et al. (2021), real-time rendering Yu et al. (2021a); Hedman et al. (2021); Reiser et al. (2021); Müller et al. (2022), self-calibration Jeong et al. (2021) and generative modeling Schwarz et al. (2020); Niemeyer & Geiger (2021); Xu et al. (2021); Deng et al. (2021). Mip-NeRF Barron et al. (2021) eliminates aliasing artifacts by adopting cone tracing with a single multi-scale MLP. In general, most of these works have difficulty in optimizing a single scene with a few number of images. Few-shot NeRF. One key limitation of NeRF is its necessity for large number of calibrated views in optimizing neural radiance fields. Some recent works attempted to address this in the case where only few observed views of the scene are available. PixelNeRFYu et al. (2021b) conditions a NeRF on image inputs using local CNN features. This conditional model allows the network to learn scene priors across multiple scenes. Stereo radiance fields Chibane et al. (2021) use local CNN features from input views for scene geometry reasoning and MVSNeRF Chen et al. (2021) combines cost volume with neural radiance field for improved performance. However, pre-training with multi-view images of numerous scenes are essential for these methods for them to learn reconstruction priors. Other works attempt different approach of optimizing NeRF from scratch in few-shot settings: DSNeRF Deng et al. (2022) makes use of depth supervision to network to optimize a scene with few images. Roessle et al. (2021) also utilizes sparse depth prior by extending into dense depth map by depth completion module to guide network optimization. On the other hand, there are models that tackle depth prior-free few-shot optimization: DietNeRF Jain et al. (2021) enforces semantic consistency between rendered images from unseen view and seen images. RegNeRF Niemeyer et al. (2022) regularizes the geometry and appearance of patches rendered from unobserved viewpoints. InfoNeRF Kim et al. (2022) constrains the density's entropy in each ray and ensures consistency across rays in the neighborhood. While these methods constrain NeRF into learning more realistic geometry, their regularizations are limited in that they require extensive dataset-specific fine-tuning and that they only provide regularization at a global level in a generalized manner. Improving upon above works, our method tackles prior-free few-shot optimization without using any depth priors, achieving more local and scene-specific regularization with warping-based consistency modeling. Self-supervised photometric consistency. In the field of multiview stereo depth estimation, consistency modeling between stereo images and their warped images has been widely used for self-

