NERF-SOS: ANY-VIEW SELF-SUPERVISED OBJECT SEGMENTATION ON COMPLEX SCENES

Abstract

Neural volumetric representations have shown the potential that Multi-layer Perceptrons (MLPs) can be optimized with multi-view calibrated images to represent scene geometry and appearance without explicit 3D supervision. Object segmentation can enrich many downstream applications based on the learned radiance field. However, introducing hand-crafted segmentation to define regions of interest in a complex real-world scene is non-trivial and expensive as it acquires per view annotation. This paper carries out the exploration of self-supervised learning for object segmentation using NeRF for complex real-world scenes. Our framework, called NeRF with Self-supervised Object Segmentation (NeRF-SOS), couples object segmentation and neural radiance field to segment objects in any view within a scene. By proposing a novel collaborative contrastive loss in both appearance and geometry levels, NeRF-SOS encourages NeRF models to distill compact geometry-aware segmentation clusters from their density fields and the self-supervised pre-trained 2D visual features. The self-supervised object segmentation framework can be applied to various NeRF models that both lead to photo-realistic rendering results and convincing segmentation maps for both indoor and outdoor scenarios. Extensive results on the LLFF, BlendedMVS, CO3Dv2, and Tank & Temples datasets validate the effectiveness of NeRF-SOS. It consistently surpasses 2D-based self-supervised baselines and predicts finer object masks than existing supervised counterparts.

1. INTRODUCTION

Scene modeling and representation are essential to the computer vision community. For instance, portable Augmented Reality (AR) devices such as the Magic Leap One can reconstruct the scene geometry and localize users (DeChicchis, 2020) . but they often struggle to comprehend the surrounding objects. This limitation poses challenges when designing interactions between humans and the environment. Although human-annotated data from diverse environments could mitigate the hurdles of understanding and segmenting the surrounding objects, collecting such data is often costly and time-consuming. Therefore, there is growing interest in developing intelligent geometry modeling frameworks that can learn from unsupervised or self-supervised techniques. Recently, neural volumetric rendering techniques, such as neural radiance field (NeRF) and its variants (Mildenhall et al., 2020a; Zhang et al., 2020; Barron et al., 2021) , have demonstrated exceptional performance in scene reconstruction, utilizing multi-layer perceptrons (MLPs) and calibrated multi-view images to generate fine-grained, unseen views. While several recent works have explored scene understanding with these techniques (Vora et al., 2021; Yang et al., 2021; Zhi et al., 2021) , they often require either dense view annotations to train a heavy 3D backbone for capturing semantic representations (Vora et al., 2021; Yang et al., 2021) , or human intervention to provide sparse semantic labels (Zhi et al., 2021) . Although recent self-supervised object discovery approaches on neural radiance fields (Yu et al., 2021c; Stelzner et al., 2021) have been effective in decomposing objects on synthetic indoor data, there is still a significant gap to be filled in applying these approaches to complex real-world scenarios. Figure 1 : Visual examples. From left to right: ground truth color images, annotated object masks, object masks rendered by NeRF-SOS, 2D image co-segmentation using DINO (Amir et al., 2021) , and object masks rendered by Semantic-NeRF (Zhi et al., 2021) , respectively. NeRF-SOS outperforms the previous methods by generating object masks with more precise local details. In contrast to previous works, we investigate a more generic setting, by using general NeRF models to segment 3D objects in real-world scenes. We propose a new self-supervised object segmentation framework for NeRF that utilizes a collaborative contrastive loss. Our approach combines features from a self-supervised pre-trained 2D backbone ("appearance level") with knowledge distilled from the geometry cues of a scene, using the density field of NeRF representations ("geometry level"). To be more specific, we adopt a self-supervised approach to learn from a pre-trained 2D feature extractor, such as DINO-ViT (Caron et al., 2021) and incorporate the inter-view visual correlations to generate distinct segmentation feature clusters within the NeRF framework. We introduce a geometry-level contrastive loss by formulating a geometric correlation volume between NeRF's density field and the segmentation clusters to make the learned feature clusters aware of scene geometry. Our proposed self-supervised object segmentation framework tailored for NeRF, dubbed NeRF-SOS, serves as a general implicit framework and can be applied to any existing NeRF models with end-to-end training. We implement and evaluate NeRF-SOS, using vanilla NeRF (Mildenhall et al., 2020a) for real-world forward-facing datasets (LLFF (Mildenhall et al., 2019) ), object-centric datasets (BlendedMVS (Yao et al., 2020) and CO3Dv2 (Reizenstein et al., 2021 )); and using NeRF++ (Zhang et al., 2020) for outdoor unbounded dataset (Tank and Temples (Riegler & Koltun, 2020) ). Experiments show that NeRF-SOS significantly outperforms existing object discovery methods and produces view-consistent segmentation clusters: a few examples are shown in Figure 1 . We summarize the main contributions as follows: • We explore how to effectively apply the self-supervised learned 2D visual feature for 3D representations through an appearance contrastive loss, which forms compact feature clusters to allow any-view object segmentation in complex real-world scenes. • We propose a new geometry contrastive loss for object segmentation. By leveraging its density field, our proposed framework injects scene geometry into the segmentation field, making the learned segmentation clusters geometry-aware. • The proposed collaborative contrastive framework can be implemented upon NeRF and NeRF++, for object-centric, indoor, and unbounded real-world scenarios. Experiments show that our self-supervised object segmentation quality consistently surpasses 2D object discovery methods and even yields finer segmentation results than the supervised NeRF counterpart (Zhi et al., 2021) .

2. RELATED WORK

Neural Radiance Fields NeRF is first proposed by Mildenhall et al. (Mildenhall et al., 2020b) , which models the underlying 3D scenes as continuous volumetric fields of color and density via layers of MLP. The input of a NeRF is a 5D vector, containing a 3D location (x, y, z) and a 2D viewing direction (θ, ϕ). Several following works emerge trying to address its limitations and improve the performance, such as unbounded scenes training (Zhang et al., 2020; Barron et al., 2021) , fast training (Sun et al., 2021; Deng et al., 2021 ), efficient inference (Rebain et al., 2020; Liu et al., 2020; Lindell et al., 2020; Garbin et al., 2021; Reiser et al., 2021; Yu et al., 2021a; Lombardi et al., 2021) , better generalization (Schwarz et al., 2020a; Trevithick & Yang, 2020; Wang et al., 2021b; Chan et al., 2020; Yu et al., 2021b; Johari et al., 2021; Varma T et al., 2022) , supporting unconstrained scene (Martin-Brualla et al., 2020; Chen et al., 2021; Xu et al., 2022) , editing (Liu et al., 2021; Jiakai et al., 2021; Wang et al., 2021a; Jang & Agapito, 2021; Kundu et al., 2022; Fan et al., 2022) , multi-task learning (Zhi et al., 2021) . In this paper, we treat NeRF as a powerful implicit scene representation and study how to segment objects from a complex real-world scene without any supervision. Object Co-segmentation without Explicit Learning Our work aims to discover and segment visually similar objects in the radiance field and render novel views with object masks. It is close to the object co-segmentation (Rother et al., 2006) which aims to segment the common objects from a set of images (Li et al., 2018) . Object co-segmentation has been widely adopted in computer vision and computer graphics applications, including browsing in photo collections (Rother et al., 2006) , 3D reconstruction (Kowdle et al., 2010) , semantic segmentation (Shen et al., 2017) , interactive image segmentation (Rother et al., 2006) , object-based image retrieval (Vicente et al., 2011) , and video object tracking/segmentation (Rother et al., 2006) . The authors in (Rother et al., 2006) first shows that segmenting two images outperforms the independent counterpart. This idea is analogous to the contrastive learning way in later approaches. Especially, the authors in (Hénaff et al., 2022) propose the self-supervised segmentation framework using object discovery networks. (Siméoni et al., 2021) localizes the objects with a self-supervised transformer. The paper (Hamilton et al., 2022) introduces the feature correspondences that distinguish between different classes. Most recently, a new co-segmentation framework based on DINO feature (Amir et al., 2021) has been proposed and achieves better results on object co-segmentation and part co-segmentation. However, extending 2D object discovery to NeRF is non-trivial as they cannot learn the geometric cues in multi-view images. uORF (Yu et al., 2021c) and ObSuRF (Stelzner et al., 2021) use slot-based CNN encoders and object-centric latent codes for unsupervised 3D scene decomposition. COLF (Smith et al., 2022) proposes a light field compositor module to accelerate NeRF-based object decomposition. Although they enable unsupervised 3D scene segmentation and novel view synthesis, experiments are on synthetic datasets with pre-defined categories, leaving a gap for complex real-world applications. NVOS (Ren et al., 2022) leverages users' scribbles for weakly-supervised object segmentation. The concurrent work, RFP (Liu et al., 2022) enables label-free object segmentation in real-world scenes with a propagation strategy. Panoptic NeRF Field (Kundu et al., 2022) proposes to represent each object instance using a separate MLP with supervisions from other models. N3F (Tschernezki et al., 2022) minimizes the distance between NeRF's rendered feature and 2D feature for scene editing. Most recently, DFFs (Kobayashi et al., 2022) propose to distill the visual feature from supervised CLIP-LSeg or self-supervised DINO into a 3D feature field via an element-wise feature distance loss function. It can discover the object using a query text prompt or a patch. In contrast, we design a new collaborative contrastive loss on both appearance and geometry levels to find the objects with a similar appearance and location without any annotations. The collaborative design is general and can be plug-and-play to different NeRF models.

3. METHOD

Overview This paper presents an extension to existing NeRF models to enable object segmentation. As shown in Figure 2 , we augment NeRF models by appending a parallel segmentation branch to predict point-wise implicit segmentation features. Specifically, NeRF-SOS can render depth (σ), segmentation (s), and color (c). We then use a self-supervised pre-trained framework (such as DINO-ViT (Caron et al., 2021) ) to generate a feature tensor (f ) from the rendered color patch (c), constructing an appearance-segmentation correlation volume between f and s. Similarly, we instantiate a geometry-segmentation correlation volume using σ and s. By generating positive/negative pairs from different views, we can distill the correlation patterns in both the visual feature and scene geometry into the compact segmentation field s. During inference, we use a clustering operation (such as K-means) to generate object masks based on the rendered feature field.

3.1. PRELIMINARIES

Neural Radiance Fields NeRF (Mildenhall et al., 2020a) represents 3D scenes as radiance fields via several layer MLPs, where each point has a value of color and density. Such a radiance field can denotes the position of the camera, d ∈ R 3 is the direction of the ray, and θ ∈ [-π, π] 2 is the angular viewing direction. Afterwards, NeRF evenly samples K points {t i } K i=1 between the near-far bound [t n , t f ] along the ray. Then, NeRF adopts volumetric rendering and numerically evaluates the ray integration (Max, 1995) by the quadrature rule: C(r) = K k=1 T (k)(1 -exp(-σ k δ k ))c k where T (k) = exp - k-1 l=1 σ l δ l , where δ k = t k+1 -t k are intervals between sampled points, and (c k , σ k ) = F (o + t k d, θ ) are output from the neural network. With this forward model, NeRF optimizes the photometric loss between rendered ray colors and ground-truth pixel colors defined as follows: L photometric = (r, C)∈R C(r) -C 2 2 where R defines a dataset collecting all pairs of ray and ground-truth colors from captured images.

3.2. CROSS VIEW APPEARANCE CORRESPONDENCE

Semantic Correspondence across Views Tremendous works have explored and demonstrated the importance of object appearance when generating compact feature correspondence across views (Hénaff et al., 2022; Li et al., 2018) . This peculiarity is then utilized in self-supervised 2D semantic segmentation frameworks (Hénaff et al., 2022; Li et al., 2018; Chen et al., 2020) to generate semantic representations by selecting positive and negative pairs with either random or KNN-based rules (Hamilton et al., 2022) . Drawing inspiration from these prior arts, we construct the visual feature correspondence for NeRF at the appearance using a heuristic rule. To be more specific, we leverage the self-supervised model (e.g., DINO-ViT (Caron et al., 2021) ) learned from 2D image sets to distill the rich representations into compact and distinct segmentation clusters. A four-layer MLP is appended to segment objects in the radiance field parallel to the density and appearance branches. During training, we first render multiple image patches from different viewpoints using Equation 1, then we feed each batch into DINO-ViT to generate feature tensors of shape H ′ ×W ′ ×C ′ . They are then used to generate the appearance correspondence volume (Teed & Deng, 2020; Hamilton et al., 2022) across views, measuring the similarity between two regions of different views: F hwh ′ w ′ = c f chw |f hw | f ′ ch ′ w ′ |f ′ h ′ w ′ | , where f and f ′ stand for the extracted DINO feature from two random patches in different views, c is the feature dimension of DINO, (h, w) and (h ′ , w ′ ) denote the spatial information on feature tensor for f and f ′ , respectively, and the c traverses through the feature channel dimension. Distilling Semantic Correspondence into Segmentation Field The correspondence volume F from DINO has been verified it has the potential in unsupervised semantic segmentations (Hamilton et al., 2022) . We next explore how to learn a segmentation field s by leveraging F . Inspired by CRF and STEGO (Hamilton et al., 2022) where they refine the initial predictions using color or feature-correlated regions in the 2D image. We propose to append an extra segmentation branch to predict the segmentation field, formulating segmentation correspondence volume by leveraging its predicted segmentation logits using the same rule with Equation 2. Then, we construct the appearance-segmentation correlation aims to enforce the elements of s and s ′ closer if f and f ′ are tightly coupled, where the expression with and without the superscript indicates two different views. The volume correlation can be achieved via an element-wise multiplication between S and F , and thereby, we have the appearance contrastive loss L app : C app (r, b) = - hwh ′ w ′ (F hwh ′ w ′ -b)S hwh ′ w ′ (3) L app = λ id C app (r id , b id ) + λ neg C app (r neg , b neg ) (4) where S hwh ′ w ′ = c s chw |s hw | s ′ ch ′ w ′ |s ′ h ′ w ′ | indicates the segmentation correspondence volume between two views, r is the cast ray fed into NeRF, b is a hyper-parameter to control the positive and negative pressure. λ id and λ neg indicate loss force between identity pairs (positive) and distinct pairs (negative). The intuition behind the above equation is that minimizing L app with respect to S, to enforce entries in segmentation field s to be large when F -b are positive items and pushes entries to be small if F -b are negative items. 

3.3. CROSS VIEW GEOMETRY CORRESPONDENCE

Constructing the "appearance-segmentation correlation" enables the clustered features with similar appearance together. However, appearance cue alone may cause spatial discontinuities, as DINO-ViT may overfocus to capture semantic parts rather than making the clusters spatial smooth (see Figure 7 ). Therefore, we propose geometric correlation volume to penalize discontinuities between neighboring points, by formulating the attractive/repulsive force using point-wise distance. Geometry Correspondence across Views Apart from distilling the visual feature from DINO into the segmentation field s, we propose to leverage the density field that already exists in NeRF models to formulate a new geometry contrastive loss to encourage spatial coherence. Specifically, given a batch of M cast ray r as NeRF's input, we can obtain the density field of size M × K where K indicates the number of sampled points along each ray. By accumulating the discrete bins along each ray, we can roughly represent the density field as a single 3D point: p = r o + r d • D (5) D(r) = K k=1 T (k)(1 -exp(-σ k δ k ))t k ( ) where p is the accumulated 3D point along the ray, D is the estimated depth value of the corresponding pixel index. Inspired by Point Transformer (Zhao et al., 2021) which uses point-wise distance as representation, we utilize the estimated point position as a geometry cue to formulate a new geometry  G hwh ′ w ′ = c 1 |g chw -g ′ ch ′ w ′ | + ϵ where g and g ′ are the estimated 3D point positions in two random patches of different views, c is 3, (h, w) and (h ′ , w ′ ) denote the spatial location on feature tensor for g and g ′ , respectively. Injecting Geometry Coherence into Segmentation Field To inject the geometry cue from the density field to the segmentation field, we formulate segmentation correspondence volume S and geometric correspondence volume G using the same rule of Equation 3. By pulling/pushing positive/negative pairs for the geometry-segmentation correlation of Equation 8, we come up with a new geometry-aware contrastive loss L geo : C geo (r, b) = - hwh ′ w ′ (G hwh ′ w ′ -b)S hwh ′ w ′ (8) L geo = λ id C geo (r id , b id ) + λ neg C geo (r neg , b neg ) ) Same as appearance contrastive loss, we find positive pairs and negative pairs via the pair-wise cosine similarity of the [CLS] tokens.

3.4. OPTIMIZING WITH STRIDE RAY SAMPLING

We adopt patch-wise ray casting during the training process, while we also leverage a Stride Ray Sampling strategy, similar to prior works (Schwarz et al., 2020b; Meng et al., 2021) to handle GPU memory bottleneck. Overall, we optimize the pipeline using a balanced loss function: L = λ 0 L photometric + λ 1 L app + λ 2 L geo , ) where λ 0 , λ 1 , and λ 2 are balancing weights.

4.1. EXPERIMENT SETUP

Datasets We evaluate all methods on four representative datasets: Local Light Field Fusion (LLFF) dataset (Mildenhall et al., 2019) , BlendedMVS (Yao et al., 2020) , CO3Dv2 (Reizenstein et al., 2021) , Training Details We first implement the collaborative contrastive loss upon the original NeRF (Mildenhall et al., 2020a) . In training, we first train NeRF-SOS without segmentation branch following the NeRF training recipe (Mildenhall et al., 2020b) for 150k iterations. Next, we load the weight and start to train the segmentation branch alone using the stride ray sampling for another 50k iterations. Model weights except the segmentation branch are kept frozen in the second phase. The loss weights λ 0 , λ 1 , λ 2 , λ id , and λ neg are set 0, 1, 0.01, 1 and 1 in training the segmentation branch. The segmentation branch is formulated as a four-layer MLP with ReLU as the activation function. The dimensions of hidden layers and the number of output layers are set as 256 and 2, respectively. The segmentation results are based on K-means clustering on the segmentation logits. We train Semantic-NeRF (Zhi et al., 2021) for 200k in total for fair comparisons. We randomly sample eight patches from different viewpoints (a.k.a batch size N is 8) in training. The patch size of each sample is set as 64 × 64, with the patch stride as 6. We use the official pre-trained DINO-ViT in a self-supervised manner on ImageNet dataset as our 2D feature extractor. The pre-trained DINO backbone is kept frozen for all layers during training. All hyperparameters are carefully tuned by a grid search, and the best configuration is applied to all experiments. All models are trained on an NVIDIA RTX A6000 GPU with 48 GB memory. We reconstruct N positives and N negatives pairs on the fly during training, given N rendered patches. More details can be found in the appendix. Metrics We adopt the Adjusted Rand Index in novel views as a metric to evaluate the clustering quality, noted as NV-ARI. We also adopt mean Intersection-over-Union to measure segmentation quality for both object and background, as we set the clusters with larger activation as foreground by DINO. To evaluate the rendering quality, we follow NeRF (Mildenhall et al., 2020a) , adopting peak signal-to-noise ratio (PSNR), the structural similarity index measure (SSIM) (Wang et al., 2004) , and learned perceptual image patch similarity (LPIPS) (Zhang et al., 2018) as evaluation metrics. 

4.2. COMPARISONS

Self-supervised Object Segmentation on LLFF We build NeRF-SOS on the vanilla NeRF (Mildenhall et al., 2020a) to validate its effectiveness on LLFF datasets. Two groups of current object segmentation are adopted for comparisons: i. NeRF-based methods, including our NeRF-SOS, and supervised Semantic-NeRF (Zhi et al., 2021) trained with annotated masks; ii. image-based object co-segmentation methods: DINO-CoSeg (Amir et al., 2021) and DOCS (Li et al., 2018) ; and iii. single-image based unsupervised segmentation: IEM (Savarese et al., 2021) follows CIS (Yang et al., 2019) to minimize the mutual information of foreground and background. As imagebased segmentation methods cannot generate novel views, we pre-render the new views using NeRF and construct image pairs between the first image in the test set with others for DINO-CoSeg (Amir et al., 2021) and DOCS (Li et al., 2018) . Evaluations on IEM also use pre-rendered color images. Quantitative comparisons against other segmentation methods are provided in Table 1 , together with qualitative visualizations shown in Figure 4 . These results convey several observations to us: 1). NeRF-SOS consistently outperforms image-based co-segmentation in evaluation metrics and view consistency. 2). Compared with SoTA supervised NeRF segmentation method (Semantic-NeRF (Zhi et al., 2021) ), our method effectively segments the object within the scene and performs on par in both evaluation metrics and visualization. Self-supervised Object Segmentation on Object-centric Scenes For the object-centric datasets BlendedMVS and CO3Dv2, we uniformly select 12.5% of total images for testing. CO3Dv2 provides coarse segmentation maps using PointRend (Kirillov et al., 2020) while parts of the annotations are missing. Therefore, we manually create faithful binary masks for training the Semantic-NeRF and evaluations. As we can see in Table 2 and Figure 5 , our self-supervised NeRF method consistently surpasses other 2D methods. We deliver more details comparisons in our supplementary materials. Self-supervised Object Segmentation on Unbounded Scene To test the generalization ability of the proposed collaborative contrastive loss, we implement it on NeRF++ (Zhang et al., 2020) to test with a more challenging unbounded scene. Here, we mainly evaluate all previously mentioned methods on scene Truck as it is the only scene captured surrounding an object provided by NeRF++. We re-implement Semantic-NeRF using NeRF++ as the backbone model for unbounded setting, termed Semantic-NeRF++. Compared with supervised Semantic-NeRF++, NeRF-SOS achieves slightly worse results on quantitative metrics (see Table 3 ). Yet from the visualizations, we see that NeRF-SOS yields quite decent segmentation quality. For example, 1). In the first row of Figure 6 , NeRF-SOS can recognize the side view mirror adjacent to the truck. 2). In the second row of Figure 6 , NeRF-SOS can distinguish the apertures between the wooden slats as those apertures have distinct depths than the neighboring slats, thanks to the geometry-aware contrastive loss. Further, we show the 3-center clustering results on the distilled segmentation field in Figure 8 . 

Impact of the Collaborative Contrastive Loss

To study the effectiveness of the collaborative contrastive loss, we adopt two baseline models by only using appearance contrastive loss or geometric contrastive loss on NeRF++ backbone. As shown in Figure 7 , we observe that the segmentation branch failed to cluster spatially continuous objects without geometric constraints (mIoU: 0.5029). Similarly, without visual cues, the model lost the perception of the central object (mIoU: 0.5516). Our full model constructs precise clusters with spatial coherence (mIoU: 0.9689).

Joint Training with NeRF Optimization

To demonstrate the advantages of two-stage training, we conduct an ablation study by jointly optimizing vanilla NeRF rendering loss and the proposed two-level collaborative contrastive loss. As shown in Table 4 , both the novel view synthesis quality and the segmentation quality significantly decreased when we optimize the two losses together. We conjecture the potential reason to be the fact that the optimization process of NeRF training is affected by the conflicting update directions, the reconstruction loss and the contrastive loss, which remains a notorious challenge in the multi-task learning area (Yu et al., 2020) . CNN-based Backbone for Feature Extraction DINO-ViT firstly concludes that ViT architecture can extract stronger semantic information than ConvNets when being self-supervised trained. To study its effect on discovering the semantic layout of scenes, we apply self-supervised ResNet50 (He et al., 2020) as backbones. The results in the second row of Table 4 imply that the ViT architecture is more suitable for our NeRF object segmentation in both expressiveness and pair-selection perspectives.

5. CONCLUSION, DISCUSSION OF LIMITATION

In this paper, we introduce NeRF-SOS, a self-supervised framework that learns object segmentation for any view in complex real-world scenes. NeRF-SOS proposes a collaborative contrastive loss in both the appearance and geometry levels. Comprehensive experiments are conducted on four different types of datasets with state-of-the-art image-based object (co-)segmentation frameworks and fully supervised Semantic-NeRF. The results show that NeRF-SOS consistently outperforms imagebased methods and sometimes generates finer segmentation details than its supervised counterparts. However, similar to other scene-specific NeRF methods, one limitation of NeRF-SOS is that it cannot segment across scenes, which we plan to explore in future work. Implementation of Stride Ray Sampling Neural radiance field casts a number of rays (typically not adjacent) from the camera origin, intersecting the pixel, to generate input 3D points in the viewing frustum. Our model requires patch-wise rendering of size (P, P ) to formulate the collaborative contrastive loss. However, we can only render a patch less than 64 × 64 in each view due to GPU memory bottleneck (Garbin et al., 2021) . Thus, it hardly covers a sufficient receptive field to capture the global context, using the pre-trained DINO. To solve this problem, we adopt a Strided Ray Sampling strategy (Schwarz et al., 2020b; Meng et al., 2021) , to enlarge the receptive field of the patches while keeping computational cost fixed. Specifically, instead of sampling a patch of adjacent locations P × P , we sample rays with an interval k, resulting in a receptive field of (P × k) × (P × k).

Hyperparameters Selection

The hyperparameters of NeRF-SOS on different datasets are shown in Table 5 . We adopt the number of b knn and b self in (Hamilton et al., 2022) as the hyperparameter for our appearance level loss (b neg and b id , respectively.). We share the appearance level hyperparameters across all datasets. We set the weights of b neg and b id in the geometry level loss with physical intuition. For example, since the radius of the foreground object in LLFF datasets is roughly 0.5 meters, we set the b neg and b id to be 0.5 and 3, respectively. Analogously, for the unbounded scene (e.g., scene Truck), the b neg and b id are set to be 1 and 5, respectively. Table 5 : Hyperparameters of NeRF-SOS on different datasets. We share the hyperparameters of appearance level for all datasets while we set the b neg and b id for geometry level loss with physical intuition. parameter name LLFF BlendedMVS CO3Dv2 Tank and Temples b id (L geo ) 0.50 0.12 0.25 1.00 b neg (L geo ) 3.00 0.60 1.00 5.00 Self-supervised Learned 2D Representations We adopt DINO-ViT (Caron et al., 2021) as our feature extractor for distillation. The training process of DINO-ViT largely simplifies self-supervised learning by applying a knowledge distillation paradigm (Hinton et al., 2015) with a momentum encoder (He et al., 2020) , where the model is simply updated by a cross-entropy loss.

A.2 HUMAN ANNOTATION DETAILS

We use the publicly available annotation tool: (labelme) for the foreground and background annotation. To be specific, we annotate all training and testing views of different scenes using multiple-polygon, extract the polygons and convert them to binary masks. The masks of scene Flower are included in our supplementary and we provide usage guidelines in README. All annotated data will be made public.

A.3 ADDITIONAL EXPERIMENTS

Comparisons with Semantic-NeRF using Sparse Label As Semantic-NeRF ables to perform label propagation with sparse annotation, we simulate sparse user annotation by randomly applying {1, 1%, 5%, 10%} foreground annotated object pixels while leaving the rest unlabeled. We can see from Figure 9 , the foreground boundaries are gradually refined when more annotations are included, which



Figure 2: The overall pipeline of the proposed NeRF-SOS. Input with rays cast from multiple views, we render the corresponding color patch (c), segmentation patch (s), and depth patch (σ).Then, appearance-segmentation correlations and geometry-segmentation correlations are used to formulate a collaborative contrastive loss, enabling NeRF-SOS to render object masks from any viewpoint using the distilled segmentation field. be formulated as F : (x, θ) → (c, σ), where x ∈ R 3 is the spatial coordinate, θ ∈ [-π, π] 2 denotes the viewing direction, and c ∈ R 3 , σ ∈ R + represent the RGB color and density, respectively. To form an image, NeRF traces a ray r = (o, d, θ) for each pixel on the image plane, where o ∈ R 3 denotes the position of the camera, d ∈ R 3 is the direction of the ray, and θ ∈ [-π, π] 2 is the angular viewing direction. Afterwards, NeRF evenly samples K points {t i } K i=1 between the near-far bound [t n , t f ] along the ray. Then, NeRF adopts volumetric rendering and numerically evaluates the ray integration(Max, 1995) by the quadrature rule:

Figure 3: Cosine similarity matrix calculated on scene Fortress.Discover Patch Relationships To construct Equation4, we build a cosine similarity matrix to effectively discover the positive/negative pairs of given patches. For each matrix, we take N randomly selected patches as inputs and adopt a pre-trained DINO-ViT to extract meaningful representations. We use the [CLS] token from ViT architecture to represent the semantic features of each patch and obtain N positive pairs by the diagonal entries and N negative pairs by querying the lowest score in each row. An example using three patches from different views is shown in Figure3. Similar toTumanyan et al. (2022), we observe that the [CLS] token from a self-supervised pre-trained ViT backbone can capture high-level semantic appearances and can effectively discover similarities between patches during the proposed end-to-end optimization process.

Figure 4: Qualitative results on scene Flower and Fortress of LLFF dataset. In the fourth column, DINO-CoSeg mistakenly matches several discrete patches, as DINO has higher activation on just a few tokens, which may lead to view-inconsistent and disconnected co-segmentation results. * superscript denotes the supervised method. DOCS and DINO-CoSeg are not able to perform novel view synthesis, and thus we perform rendering before segmentation using a vanilla NeRF. level correspondence volume across views by measuring point-wise absolute distance:

Figure 5: Novel view object segmentation results on object-centric datasets: BlendedMVS (the 1st row) and CO3Dv2 (the 2nd row). NeRF-SOS (the 3rd column) produces masks with finer details.

Figure 7: Object segmentations using three loss variants are shown in columns 3, 4, and 5: the collaborative loss (APP.+Geo.), appearance-only loss (App.); geometric-only loss (Geo.).

Implementation of the Patch Selection We reconstruct positive and negative pairs on the fly during training. Given N rendered patches from N different viewpoints in training, we fed the patches into the DINO-ViT and obtained the [CLS] tokens. Next, we compute a N × N similarity matrix using the cosine similarity with the [CLS] tokens. The negative pairs are selected from the pair with the lowest similarity in each row; the positive pairs are set as the identity pairs. Overall, 2N pairs (N positives + N negatives) are formulated per iteration to compute the collaborative contrastive loss.

Quantitative comparison of the novel view synthesis and object segmentation of LLFF dataset on the scenes Flower and Fortress.

Quantitative evaluation of the novel view synthesis and object segmentation on BlendedMVS and CO3Dv2 datasets, with several 2D object discovery frameworks and the supervised Semantic-NeRF. Results on each dataset are averaged on all scenes. BlendedMVS PSNR ↑ SSIM ↑ LPIPS ↓ NV-ARI ↑ IoU(BG) ↑ IoU(FG) ↑ mIoU ↑

Quantitative results of the object segmentation results on outdoor unbounded scene Truck, with several 2D object discovery frameworks and the supervised Semantic-NeRF.

Experiments on multiple NeRF-SOS variants. We show the results of joint training of the NeRF and contrastive loss in the first row, NeRF-SOS with ResNet50 as feature extractor in the second row, and our final model in the last row.

ACKNOWLEDGEMENT

We would like to express our gratitude to Xinhang Liu from HKUST and Zhongzheng Ren from UIUC for their invaluable contribution to the experimental work presented in this paper. Their expertise and time proved to be immensely helpful in conducting the necessary comparisons with their methods, especially since their codes were not publicly available. We greatly appreciate their generosity and support throughout the research process.

annex

 (Liu et al., 2022) , tackles the label-free NeRF segmentation on real-world scenes, but there is still room for their segmentation accuracy. To handle the high-quality label-free NeRF segmentation on real-world scenes, NeRF-SOS leverages the proposed collaborative contrastive loss to do self-supervised object segmentation. Experiments in the following Tables 7 and Figure 10 demonstrate our label-free method consistently outperforms RFP and weakly-supervised NVOS.Qualitative Visualization on More Views Qualitative comparisons on LLFF dataset can be found in Figure 11 and Figure 12 , respectively. Qualitative comparisons on BlendedMVS dataset can be found in Figure 13 and Figure 14 , respectively. Qualitative comparisons on CO3Dv2 dataset can be found in Figure 15 and Figure 16 , respectively. Qualitative comparisons on Tank and Temple dataset can be found in Figure 17 . Here, we visualize three different views to show the segmentation consistency across views. Visualized video can be found in the supplementary. 

