NERF-SOS: ANY-VIEW SELF-SUPERVISED OBJECT SEGMENTATION ON COMPLEX SCENES

Abstract

Neural volumetric representations have shown the potential that Multi-layer Perceptrons (MLPs) can be optimized with multi-view calibrated images to represent scene geometry and appearance without explicit 3D supervision. Object segmentation can enrich many downstream applications based on the learned radiance field. However, introducing hand-crafted segmentation to define regions of interest in a complex real-world scene is non-trivial and expensive as it acquires per view annotation. This paper carries out the exploration of self-supervised learning for object segmentation using NeRF for complex real-world scenes. Our framework, called NeRF with Self-supervised Object Segmentation (NeRF-SOS), couples object segmentation and neural radiance field to segment objects in any view within a scene. By proposing a novel collaborative contrastive loss in both appearance and geometry levels, NeRF-SOS encourages NeRF models to distill compact geometry-aware segmentation clusters from their density fields and the self-supervised pre-trained 2D visual features. The self-supervised object segmentation framework can be applied to various NeRF models that both lead to photo-realistic rendering results and convincing segmentation maps for both indoor and outdoor scenarios. Extensive results on the LLFF, BlendedMVS, CO3Dv2, and Tank & Temples datasets validate the effectiveness of NeRF-SOS. It consistently surpasses 2D-based self-supervised baselines and predicts finer object masks than existing supervised counterparts.

1. INTRODUCTION

Scene modeling and representation are essential to the computer vision community. For instance, portable Augmented Reality (AR) devices such as the Magic Leap One can reconstruct the scene geometry and localize users (DeChicchis, 2020) . but they often struggle to comprehend the surrounding objects. This limitation poses challenges when designing interactions between humans and the environment. Although human-annotated data from diverse environments could mitigate the hurdles of understanding and segmenting the surrounding objects, collecting such data is often costly and time-consuming. Therefore, there is growing interest in developing intelligent geometry modeling frameworks that can learn from unsupervised or self-supervised techniques. Recently, neural volumetric rendering techniques, such as neural radiance field (NeRF) and its variants (Mildenhall et al., 2020a; Zhang et al., 2020; Barron et al., 2021) , have demonstrated exceptional performance in scene reconstruction, utilizing multi-layer perceptrons (MLPs) and calibrated multi-view images to generate fine-grained, unseen views. While several recent works have explored scene understanding with these techniques (Vora et al., 2021; Yang et al., 2021; Zhi et al., 2021) , they often require either dense view annotations to train a heavy 3D backbone for capturing semantic representations (Vora et al., 2021; Yang et al., 2021) , or human intervention to provide sparse semantic labels (Zhi et al., 2021) . Although recent self-supervised object discovery approaches on neural radiance fields (Yu et al., 2021c; Stelzner et al., 2021) have been effective in decomposing objects on synthetic indoor data, there is still a significant gap to be filled in applying these approaches to complex real-world scenarios.

Color Images

Mask Annotations Ours DINO-CoSeg SemanticNeRF In contrast to previous works, we investigate a more generic setting, by using general NeRF models to segment 3D objects real-world scenes. We propose a new self-supervised object segmentation framework for NeRF that utilizes a collaborative contrastive loss. Our approach combines features from a self-supervised pre-trained 2D backbone ("appearance level") with knowledge distilled from the geometry cues of a scene, using the density field of NeRF representations ("geometry level"). To be more specific, we adopt a self-supervised approach to learn from a pre-trained 2D feature extractor, such We summarize the main contributions as follows: • We explore how to effectively apply the self-supervised learned 2D visual feature for 3D representations through an appearance contrastive loss, which forms compact feature clusters to allow any-view object segmentation in complex real-world scenes. • We propose a new geometry contrastive loss for object segmentation. By leveraging its density field, our proposed framework injects scene geometry into the segmentation field, making the learned segmentation clusters geometry-aware. • The proposed collaborative contrastive framework can be implemented upon NeRF and NeRF++, for object-centric, indoor, and unbounded real-world scenarios. Experiments show that our self-supervised object segmentation quality consistently surpasses 2D object discovery methods and even yields finer segmentation results than the supervised NeRF counterpart (Zhi et al., 2021) .

2. RELATED WORK

Neural Radiance Fields NeRF is first proposed by Mildenhall et al. (Mildenhall et al., 2020b) , which models the underlying 3D scenes as continuous volumetric fields of color and density via layers of MLP. The input of a NeRF is a 5D vector, containing a 3D location (x, y, z) and a 2D viewing direction (θ, ϕ). Several following works emerge trying to address its limitations and improve



Figure 1: Visual examples. From left to right: ground truth color images, annotated object masks, object masks rendered by NeRF-SOS, 2D image co-segmentation using DINO (Amir et al., 2021), and object masks rendered by Semantic-NeRF (Zhi et al., 2021), respectively. NeRF-SOS outperforms the previous methods by generating object masks with more precise local details.

as DINO-ViT (Caron et al., 2021) and incorporate the inter-view visual correlations to generate distinct segmentation feature clusters within the NeRF framework. We introduce a geometry-level contrastive loss by formulating a geometric correlation volume between NeRF's density field and the segmentation clusters to make the learned feature clusters aware of scene geometry. Our proposed self-supervised object segmentation framework tailored for NeRF, dubbed NeRF-SOS, serves as a general implicit framework and can be applied to any existing NeRF models with end-to-end training. We implement and evaluate NeRF-SOS, using vanilla NeRF (Mildenhall et al., 2020a) for real-world forward-facing datasets (LLFF (Mildenhall et al., 2019)), object-centric datasets (BlendedMVS (Yao et al., 2020) and CO3Dv2 (Reizenstein et al., 2021)); and using NeRF++ (Zhang et al., 2020) for outdoor unbounded dataset (Tank and Temples (Riegler & Koltun, 2020)). Experiments show that NeRF-SOS significantly outperforms existing object discovery methods and produces view-consistent segmentation clusters: a few examples are shown in Figure 1.

