GECONERF: FEW-SHOT NEURAL RADIANCE FIELDS VIA GEOMETRIC CONSISTENCY

Abstract

We present a novel framework to regularize Neural Radiance Field (NeRF) in a few-shot setting with a geometry-aware consistency regularization. The proposed approach leverages a rendered depth map at unobserved viewpoint to warp sparse input images to the unobserved viewpoint and impose them as pseudo ground truths to facilitate learning of NeRF. By encouraging such geometry-aware consistency at a feature-level instead of using pixel-level reconstruction loss, we regularize the NeRF at semantic and structural levels while allowing for modeling viewdependent radiance to account for color variations across viewpoints. We also propose an effective method to filter out erroneous warped solutions, along with training strategies to stabilize training during optimization. We show that our model achieves competitive results compared to state-of-the-art few-shot NeRF models.

1. INTRODUCTION

Recently, representing a 3D scene as a Neural Radiance Field (NeRF) Mildenhall et al. (2020) has proven to be a powerful approach for novel view synthesis and 3D reconstruction Barron et al. (2021) ; Jain et al. (2021) ; Chen et al. (2021) . However, despite its impressive performance, NeRF requires a large number of densely, well distributed calibrated images for optimization, which limits its applicability. When limited to sparse observations, NeRF easily overfits to the input view images and is unable to reconstruct correct geometry Zhang et al. (2020) . The task that directly addresses this problem, also called a few-shot NeRF, aims to optimize highfidelity neural radiance field in such sparse scenarios Jain et al. (2021) ; Kim et al. (2022) ; Niemeyer et al. (2022) , countering the underconstrained nature of said problem by introducing additional priors. Specifically, previous works attempted to solve this by utilizing a semantic feature Jain et al. (2021) , entropy minimization Kim et al. (2022) , SfM depth priors Deng et al. (2022) or normalizing flow Niemeyer et al. (2022) , but their necessity for handcrafted methods or inability to extract local and fine structures limited their performance. To alleviate these issues, we propose a novel regularization technique that enforces a geometric consistency across different views with a depth-guided warping and a geometry-aware consistency modeling. Based on these, we propose a novel framework, called Neural Radiance Fields with Geometric Consistency (GeCoNeRF), for training neural radiance fields in a few-shot setting. Our key insight is that we can leverage a depth rendered by NeRF to warp sparse input images to novel viewpoints, and use them as pseudo ground truths to facilitate learning of fine details and highfrequency features by NeRF. By encouraging images rendered at novel views to model warped images with a consistency loss, we can successfully constrain both geometry and appearance to boost fidelity of neural radiance fields even in highly under-constrained few-shot setting. Taking into consideration non-Lambertian nature of given datasets, we propose a feature-level regularization loss that captures contextual and structural information while allowing for modeling view-dependent color differences. We also present a method to generate a consistency mask to prevent inconsistently warped information from harming the network. Finally, we provide coarse-to-fine training strategies for sampling and pose generation to stabilize optimization of the model. We demonstrate the effectiveness of our method on synthetic and real datasets Mildenhall et al. (2020) ; Jensen et al. (2014) . Experimental results prove the effectiveness of the proposed model over the latest methods for few-shot novel view synthesis. Pose Difference 𝑅 !→# ǁ ( -)ǁ1 Unseen View Warped 𝐼 !→# Unseen View Rendered 𝐼 # Feature Extractor 𝑓 # Consistency Modeling Loss 𝐼 ! Gradient Backpropagation Flow Field ψ Inverse Warping Occlusion Mask 𝑀 Ground Truth Image 𝐼 ! Seen Depth Rendered 𝐷 ! Unseen Depth Rendered 𝐷 # Mask Generation Figure 1 : Illustration of our consistency modeling pipeline for few-shot NeRF. Given an image I i and estimated depth map D j of j-th unobserved viewpoint, we warp the image I i to that novel viewpoint as I i→j by establishing geometric correspondence between two viewpoints. Using the warped image as a pseudo ground truth, we cause rendered image of unseen viewpoint, I j , to be consistent in structure with warped image, with occlusions taken into consideration.

2. RELATED WORK

Neural radiance fields. Among the most notable of approaches regarding the task of novel view synthesis and 3D reconstruction is Neural Radiance Field (NeRF) Mildenhall et al. (2020) , where photo-realistic images are rendered by a simple MLP architecture. Sparked by its impressive performance, a variety of follow-up studies based on its continuous neural volumetric representation have been prompted, including dynamic and deformable scenes Park et al. ( 2021 Mip-NeRF Barron et al. (2021) eliminates aliasing artifacts by adopting cone tracing with a single multi-scale MLP. In general, most of these works have difficulty in optimizing a single scene with a few number of images. Few-shot NeRF. One key limitation of NeRF is its necessity for large number of calibrated views in optimizing neural radiance fields. Some recent works attempted to address this in the case where only few observed views of the scene are available. PixelNeRFYu et al. (2021b) conditions a NeRF on image inputs using local CNN features. This conditional model allows the network to learn scene priors across multiple scenes. Stereo radiance fields Chibane et al. (2021) use local CNN features from input views for scene geometry reasoning and MVSNeRF Chen et al. (2021) combines cost volume with neural radiance field for improved performance. However, pre-training with multi-view images of numerous scenes are essential for these methods for them to learn reconstruction priors. Other works attempt different approach of optimizing NeRF from scratch in few-shot settings: DSNeRF Deng et al. (2022) makes use of depth supervision to network to optimize a scene with few images. Roessle et al. (2021) also utilizes sparse depth prior by extending into dense depth map by depth completion module to guide network optimization. On the other hand, there are models that tackle depth prior-free few-shot optimization: DietNeRF Jain et al. (2021) enforces semantic consistency between rendered images from unseen view and seen images. RegNeRF Niemeyer et al. (2022) regularizes the geometry and appearance of patches rendered from unobserved viewpoints. InfoNeRF Kim et al. (2022) constrains the density's entropy in each ray and ensures consistency across rays in the neighborhood. While these methods constrain NeRF into learning more realistic geometry, their regularizations are limited in that they require extensive dataset-specific fine-tuning and that they only provide regularization at a global level in a generalized manner. Improving upon above works, our method tackles prior-free few-shot optimization without using any depth priors, achieving more local and scene-specific regularization with warping-based consistency modeling. Self-supervised photometric consistency. In the field of multiview stereo depth estimation, consistency modeling between stereo images and their warped images has been widely used for self-supervised training Godard et al. (2017) ; Garg et al. (2016) ; Zhou et al. (2017) In weakly supervised or unsupervised settings Huang et al. (2021) ; Khot et al. (2019) where there is lack of ground truth depth information, consistency modeling between images with geometry-based warping is used as a supervisory signal Zhou et al. (2017) ; Huang et al. (2021) ; Khot et al. (2019) formulating depth learning as a form of reconstruction task between viewpoints. Recently, methods utilizing self-supervised photometric consistency have been introduced to NeRF: concurrent works such as NeuralWarp Darmon et al. (2022 ), StructNeRF Chen et al. (2022) and Geo-NeuS Fu et al. (2022) model photometric consistency between source images and their warped counterparts from other source viewpoints to improve their reconstruction quality. However, these methods only discuss dense view input scenarios where pose differences between source viewpoints are small, and do not address their behavior in few-shot settings -where sharp performance drop is expected due to scarcity of input viewpoints and increased difficulty in the warping procedure owing to large viewpoint differences and heavy self-occlusions. RapNeRF Zhang et al. (2022) uses geometry-based reprojection method to enhance view extrapolation performance, and Bortolon et al. (2022) uses depth rendered by NeRF as correspondence information for view-morphing module to synthesize images between input viewpoints. However, these methods do not take occlusions into account, and their pixel-level photometric consistency modeling comes with downside of suppressing view-dependent specular effects.

3. PRELIMINARIES

Neural Radiance Field (NeRF) Mildenhall et al. (2020) represents a scene as a continuous function f θ represented by a neural network with parameters θ, where the points are sampled along rays, represented by r, for evaluation by the neural network. Typically, the sampled coordinates x ∈ R 3 and view direction d ∈ R 2 are transformed by a positional encoding γ into Fourier features Tancik et al. (2020) that facilitates learning of high-frequency details. The neural network f θ takes as input the transformed coordinate γ(x) and viewing directions γ(d), and outputs a view-invariant density value σ ∈ R and a view-dependent color value c ∈ R 3 such that {c, σ} = f θ (γ(x), γ(d)). (1) With a ray parametrized as r p (t) = o + td p from the camera center o through the pixel p along direction d p , the color is rendered as follows: C(r p ) = t f tn T (t)σ(r p (t))c(r p (t), d p )dt, where T (t) = exp - t tn σ(r p (s))ds , where C(r p ) is a predicted color value at the pixel p along the ray r p (t) from t n to t f , and T (t) denotes an accumulated transmittance along the ray from t n to t. To optimize the networks f θ , the observation loss L obs enforces the rendered color values to be consistent with ground truth color value C ′ (r): L obs = rp∈R ∥C ′ (r p ) -C(r p )∥ 2 2 , where R represents a batch of training rays.

4.1. MOTIVATION AND OVERVIEW

Let us denote an image at i-th viewpoint as I i . In a few-shot novel view synthesis, NeRF is given only a few images {I i } for i ∈ {1, ..., N } with small N , e.g., N = 3 or N = 5. The objective of novel view synthesis is to train the mapping function f θ that can be used to recover an image I j at j-th unseen or novel viewpoint. As we described above, in the few-shot setting, given {I i }, directly optimizing f θ solely with the pixel-wise reconstruction loss L obs is limited by its inability to model view-dependent effects, and thus an additional regularization to encourage the network f θ to generate consistent appearance and geometry is required. To achieve this, we propose a novel regularization technique to enforce a geometric consistency across different views with depth-guided warping and consistency modeling. We focus on the fact that NeRF Mildenhall et al. (2020) inherently renders not only color image but depth image as well. Combined with known viewpoint difference, the rendered depths can be used to define a geometric correspondence relationship between two arbitrary views. Specifically, we consider a depth image rendered by the NeRF model, D j at unseen viewpoint j. By formulating a warping function ψ(I i ; D j , R i→j ) that warps an image I i according to the depth D j and viewpoint difference R i→j , we can encourage a consistency between warped image I i→j = ψ(I i ; D j , R i→j ) and rendered image I j at j-th unseen viewpoint, which in turn improves the few-shot novel view synthesis performance. This framework can overcome the limitations of previous few-shot setting approaches Mildenhall et al. (2020) ; Chen et al. (2021) ; Barron et al. (2021) , improving not only global geometry but also high-frequency details and appearance as well. In the following, we first explain how input images can be warped to unseen viewpoints in our framework. Then, we demonstrate how we impose consistency upon the pair of warped image and rendered image for regularization, followed by explanation of occlusion handling method and several training strategies that proved crucial for stabilization of NeRF optimization in few-shot scenario.

4.2. RENDERED DEPTH-GUIDED WARPING

To render an image at novel viewpoints, we first sample a random camera viewpoint, from which corresponding ray vectors are generated in a patch-wise manner. As NeRF outputs density and color values of sampled points along the novel rays, we use recovered density values to render a consistent depth map. Following Mildenhall et al. (2020) , we formulate per-ray depth values as weighted composition of distances traveled from origin. Since ray r p corresponding to pixel p is parameterized as r p (t) = o + td p , the depth rendering is defined similarly to the color rendering: D(r p ) = t f tn T (t)σ(r p (t))tdt, where D(r p ) is a predicted depth along the ray r p . As described in Figure 1 , we use the rendered depth map D j to warp input ground truth image I i to j-th unseen viewpoint and acquire a warped image I i→j , which is defined as a process such that I i→j = ψ(I i ; D j , R i→j ). More specifically, pixel location p j in target unseen viewpoint image is transformed to p j→i at source viewpoint image by viewpoint difference R j→i and camera intrinsic parameter K such that p j→i ∼ KR j→i D j (p j )K -1 p j , where ∼ indicates approximate equality and the projected coordinate p j→i is a continuous value. With a differentiable sampler, we extract color values of p j→i on I i . More formally, the transforming components process can be written as follows: I i→j (p j ) = sampler(I i ; p j→i ), (6) where sampler(•) is a bilinear sampling operator Jaderberg et al. (2015) . Acceleration. Rendering full image with NeRF voluemtric rendering is computationally heavy and extremely timetaking, requiring tens of seconds for a single iteration. To overcome the computational bottleneck of full image rendering and warping, rays are sampled on a strided grid to make the patch with stride s, which we have set as 2. After the rays undergo volumetric rendering, we upsample the low-resolution depth map back to original resolution with bilinear interpolation. This full-resolution depth map is used for the inverse warping. This way, detailed warped patches of full-resolution can be generated with only a fraction of computational cost that would be required when rendering the original sized ray batch.

4.3. CONSISTENCY MODELING

Given the rendered patch I j at j-th viewpoint and the warped patch I i→j with depth D j and viewpoint difference R i→j , we define the consistency between the two to encourage additional regularization for globally consistent rendering. One viable option is to naïvely apply the pixel-wise image reconstruction loss L pix such that L pix = ∥I i→j -I j ∥. (7) However, we observe that this simple strategy is prone to cause failures in reflectant non-Lambertian surfaces where appearance changes greatly regarding viewpoints Zhan et al. (2018) . In addition, geometry-related problems, such as self-occlusion and artifacts, prohibits naïve usage of pixel-wise image reconstruction loss for regularization in unseen viewpoints. Feature-level consistency modeling. To overcome these issues, we propose masked feature-level regularization loss that encourages structural consistency while ignoring view-dependent radiance effects, as illustrated in Figure 2 . Given an image I as an input, we use a convolutional network to extract multi-level feature maps such that f ϕ,l (I) ∈ R H l ×W l ×C l , with channel depth C l for l-th layer. To measure feature-level consistency between warped image I i→j and rendered image I j , we extract their features maps from L layers and compute difference within each feature map pairs that are extracted from the same layer. In accordance with the idea of using the warped image I i→j as pseudo ground truths, we allow a gradient backpropagation to pass only through the rendered image and block it for the warped image. By applying the consistency loss at multiple levels of feature maps, we cause I j to model after I i→j both on semantic and structural level. Formally written, the consistency loss L cons is defined as such that, L cons = L l=1 1 C l f l ϕ (I j→i ) -f l ϕ (I j ) . For this loss function L cons , we find l-1 distance function most suited for our task and utilize it to measure consistency across feature difference maps. Empirically, we have discovered that VGG-19 network Simonyan & Zisserman (2014) yields best performance in modeling consistencies, likely due to the absence of normalization layers Johnson et al. (2016) that scale down absolute values of feature differences. Therefore, we employ VGG19 network as our feature extractor network f ϕ throughout all of our models. It should be noted that our loss function differs from that of DietNeRF Jain et al. (2021) in that while DietNeRF's consistency loss is limited to regularizing the radiance field in a globally semantic level, our loss combined with warping module is also able to give the network highly rich information on a local, structural level as well. In other words, contrary to DietNeRF giving only high-level feature consistency, our method of using multiple levels of convolutional network for feature difference calculation can be interpreted as enforcing a mixture of all levels, from high-level semantic consistency to low-level structural consistency. Occlusion handling. In order to prevent imperfect and distorted warpings caused by erroneous geometry from influencing the model, which degrade overall reconstruction quality, we construct consistency mask M l to let NeRF ignore regions with geometric inconsistencies, as demonstrated in Figure 3 . Instead of applying mask to the images before inputting them into feature extractor network, we apply resized masks M l directly to the feature maps, after using nearest-neighbor down-sampling to make them match the dimensions of l-th layer outputs. View i View j 𝐼 ! 𝐼 "→! 𝑀 $ We generate M by measuring consistency between rendered depth values from target viewpoint and source viewpoint such that M (p j ) = ∥D j (p j ) -D i (p j→i )∥ < τ . (9) where [•] is Iverson bracket, and p j→i refers to the corresponding pixel in source viewpoint i for reprojected target pixel p j of j-th viewpoint. Here we measure euclidean distance between depth points rendered from target and source viewpoints as a criterion for a threshold masking. As illustrated in Figure 4 , if distance between two points are greater than given threshold value τ , we determine two rays as rendering depths of separate surfaces and mask out the corresponding pixel in viewpoint I j . The process takes place over every pixel in viewpoint I j to generate a mask M the same size as rendered pixels. Through this technique, we filter out problematic solutions at feature level and regularize NeRF with only high-confidence image features. Based on this, the consistency loss L cons is extended as such that L M cons = L l=1 1 C l m l M l ⊙ (f l ϕ (I i→j ) -f l ϕ (I j )) , where m l is the sum of non-zero values. Edge-aware disparity regularization. Since our method is dependent upon the quality of depth rendered by NeRF, we directly impose additional regularization on rendered depth to facilitate optimization. We further encourage local depth smoothness on rendered scenes by imposing l-1 penalty on disparity gradient within randomly sampled patches of input views. In addition, inspired by Godard et al. (2017) , we take into account the fact that depth discontinuities in depth maps are likely to be aligned to gradients of its color image, and introduce an edge-aware term with image gradients ∂I to weight the disparity values. Specifically, following Godard et al. (2017) , we regularize for edge-aware depth smoothness such that L reg = |∂ x D * i |e -|∂xIi| + |∂ y D * i |e -|∂yIi| , ( ) where D * i = D i /D i is the mean-normalized inverse depth from Godard et al. (2017) to discourage shrinking of the estimated depth.

4.4. TRAINING STRATEGY

In this section, we present novel training strategies to learn the model with the proposed losses. Total losses. We optimize our model with a combined final loss of original NeRF's pixel-wise reconstruction loss L obs and two types of regularization loss, L M cons for unobserved view consistency modeling and L reg for disparity regularization. 2020) show that in 3-view setting, our method captures fine details more robustly (such as the wire in the mic scene) and produces less artifacts (background in the materials scene) compared to previous methods. We show GeCoNeRF's results (e) with its rendered depth (f). Progressive camera pose generation. Difficulty of of accurate warping increases the further target view is from the source view, which means that sampling far camera poses straight from the beginning of training may have negative effects on our model. Therefore, we first generate camera poses near source views, then progressively further as training proceeds. We sample noise value uniformly between an interval of [-β, +β] and add it to the original Euler rotation angles of input view poses, with parameter β growing linearly from 3 to 9 degrees throughout the course of optimization. This design choice can be intuitively understood as stabilizing locations near observed viewpoints at start and propagating this regularization to further locations, where warping becomes progressingly more difficult. Positional encoding frequency annealing. We find that most of the artifacts occurring are highfrequency occlusions that fill the space between scene and camera. This behaviour can be effectively suppressed by constraining the order of fourier positional encoding Tancik et al. (2020) to low dimensions. Due to this reason, we adopt coarse-to-fine frequency annealing strategy previously used by Park et al. (2021) to regularize our optimization. This strategy forces our network to primarily optimize from coarse, low-frequency details where self-occlusions and fine features are minimized, easing the difficulty of warping process in the beginning stages of training. Following Park et al. (2021) , the annealing equation is α(t) = mt/K, with m as the number of encoding frequencies, t as iteration step, and we set hyper-parameter K as 15k.

5.1. EXPERIMENTAL SETTINGS

Baselines. We use mip-NeRF Barron et al. (2021) as our backbone. We give our comparisons to the baseline and several state-of-the-art models for few-shot NeRF: InfoNeRF Kim et al. (2022) , DietNeRF Jain et al. (2021), and RegNeRF Niemeyer et al. (2022) . Datasets and metrics. We evaluate our model on NeRF-Synthetic Mildenhall et al. (2020) and LLFF Mildenhall et al. (2019) . NeRF-Synthetic is a realistically rendered 360 • synthetic dataset comprised of 8 scenes. We randomly sample 3 viewpoints out of 100 training images in each scene, with 200 testing images for evaluation. We also conduct experiments on LLFF benchmark dataset, which consists of real-life forward facing scenes. Following RegNeRF Niemeyer et al. (2022) , we apply standard settings by selecting test set evenly from list of every 8th image and selecting 3 reference views from remaining images. We quantify novel view synthesis quality using PSNR, Structural Similarity Index Measure (SSIM) Wang et al. (2004) , LPIPS perceptual metric Zhang et al. (2018) and "average" error metric introduced in Barron et al. (2021) to report the mean value of metrics for all scenes in each dataset. Implementation details. Our main model is built on top of the JAX mip-NeRF codebase Barron et al. (2021) . We use Adam optimizer using an exponential learning rate decay. Our model is trained for 60k iterations for 4 hours on two NVIDIA RTX3090Ti GPUs. We provide more implementation details in supplementary materials. Table 1 : Quantitative comparison on NeRF-Synthetic (Mildenhall et al., 2020) and LLFF (Mildenhall et al., 2019) datasets. Methods NeRF-Synthetic (Mildenhall et al., 2020) LLFF (Mildenhall et al., 2019 ) PSNR ↑ SSIM ↑ LPIPS ↓ Avg. ↓ PSNR ↑ SSIM ↑ LPIPS ↓ Avg. ↓ NeRF (Mildenhall et al., 2020) 14.73 0.734 0.451 0.199 13.34 0.373 0.451 0.255 mip-NeRF (Barron et al., 2021) 17 2022), in 3-view settings. We observe that our warping-based consistency enables GeCoNeRF to capture fine details that mip-NeRF and RegNeRF struggle to capture in same sparse view scenarios, as demonstrated with the mic scene. Our method also displays higher stability in rendering smooth surfaces and reducing artifacts in background in comparison to previous models, as shown in the results of the materials scene. We argue that these results demonstrate how our method, through generation of warped pseudo ground truth patches, is able to give the model local, scene-specific regularization that aids recovery of fine details, which previous few-shot NeRF models with their global, generalized priors were unable to accomplish. Quantitative comparisons. Comparisons in Table 1 shows our model's competitive results in LLFF dataset, whose PSNR results show large increase in comparison to mip-NeRF baseline and competitive compared to RegNeRF. We see that our warping-based consistency modeling successfully prevents overfitting and artifacts, which allows our model to perform better quantitatively. Progressive training strategies. In Table 3 , we justify our progressive training strategies with additional experiments on NeRF-Synthetic dataset, while in the main ablation we conduct an ablation with progressive annealing only. For pose generation, we sample pose angle from large interval in the beginning, instead of slowly growing the interval. For positional encoding, we replace progressive annealing with naïve positional encoding used in NeRF. We observe that their absence causes destabilization of the model and degradation in appearance, respectively. Feature-level loss vs. pixel-level loss. In Table 4 , we conduct a quantitative ablation comparisons between featurelevel consistency loss L cons and pixel-level photometric consistency loss L pix , both with occlusion masking. As shown in Figure 8 , naïvely applying pixel-level loss for consistency modeling leads to broken geometry. This phenomenon can be attributed to L pix being agnostic to view-dependent specular effects, which the network tries to model by altering or erasing altogether non-Lambertian surfaces. Its result, shown in (a) of Figure 9 , displays divergent behaviours such as heavy artifact generation, while our method (b) succeeds in recovering detailed geometry of the scene under the same setting. As discussed in Section 2, we argue that large view differences and scarcity of reference images make it difficult for NeRF to refine geometry with consistency modeling between known views. Our work's novel contributions allow consistency modeling to be adopted to few-shot NeRF to facilitate stable training under such extreme conditions, distinguishing our work from above methods.

6. CONCLUSION

We present GeCo-NeRF, a novel few-shot NeRF regularization method. We regularize geometry by modeling feature-level consistency at unobserved viewpoints between using the warped images, regularizing NeRF for learning of robust geometry. Further techniques and training strategies we propose prove to have stabilizing effect and facilitate optimization of our network. Our experimental evaluation demonstrates our method's competitiveness in regards to other state-of-the-art models.



); Tretschk et al. (2021); Pumarola et al. (2021); Attal et al. (2021), real-time rendering Yu et al. (2021a); Hedman et al. (2021); Reiser et al. (2021); Müller et al. (2022), self-calibration Jeong et al. (2021) and generative modeling Schwarz et al. (2020); Niemeyer & Geiger (2021); Xu et al. (2021); Deng et al. (2021).

Figure 2: Illustration of the proposed framework. GeCoNeRF regularizes the networks with consistency modeling. Consistency loss function L Mcons is applied between unobserved viewpoint image and warped observed viewpoint image, while disparity regularization loss L reg regularizes depth at seen viewpoints.

Figure 3: Visualization of consistency modeling process. (a) ground truth patch, (b) rendered patch at novel viewpoint, (c) warped patch, from input viewpoint to novel viewpoint, (d) occlusion mask with threshold masking, and (e) final warped patch with occlusion masking at novel viewpoint.

Figure 4: Occlusion-aware mask generation. Mask generation by comparing geometry between novel view j and source view i, with I i→j being warped patch generated for view j. For (a) and (b), warping does not occur correctly due to artifacts and self-occlusion, respectively. Such pixels are masked out by M l , allowing only (c), with accurate warping, as training signal for rendered image I j .

Figure5: Qualitative comparison on NeRF-SyntheticMildenhall et al. (2020) show that in 3-view setting, our method captures fine details more robustly (such as the wire in the mic scene) and produces less artifacts (background in the materials scene) compared to previous methods. We show GeCoNeRF's results (e) with its rendered depth (f).

Figure6: Qualitative results onLLFF Mildenhall et al. (2019). Comparison with baseline mip-NeRF shows that our model learns of coherent depth and geometry in extremely sparse 3-view setting.5.2 COMPARISONSQualitative comparisons. Qualitative comparison results in Figure5and 6 demonstrate that our model shows superior performance to baseline mip-NeRFBarron et al. (2021) and previous state-of-the-art model, RegNeRFNiemeyer et al. (2022), in 3-view settings. We observe that our warping-based consistency enables GeCoNeRF to capture fine details that mip-NeRF and RegNeRF struggle to capture in same sparse view scenarios, as demonstrated with the mic scene. Our method also displays higher stability in rendering smooth surfaces and reducing artifacts in background in comparison to previous models, as shown in the results of the materials scene. We argue that these results demonstrate how our method, through generation of warped pseudo ground truth patches, is able to give the model local, scene-specific regularization that aids recovery of fine details, which previous few-shot NeRF models with their global, generalized priors were unable to accomplish. Quantitative comparisons. Comparisons in Table1shows our model's competitive results in LLFF dataset, whose PSNR results show large increase in comparison to mip-NeRF baseline and competitive compared to RegNeRF. We see that our warping-based consistency modeling successfully prevents overfitting and artifacts, which allows our model to perform better quantitatively.

Figure 7: Qualitative ablation. Our qualitative ablation results on Horns scene shows the contribution of each module in performance of our model at 3-view scenario.

Figure 8: L pix vs. L cons comparison.

Figure 9: Consistency between known views vs. our method.

Ablation study.

