NERF-SOS: ANY-VIEW SELF-SUPERVISED OBJECT SEGMENTATION ON COMPLEX SCENES

Abstract

Neural volumetric representations have shown the potential that Multi-layer Perceptrons (MLPs) can be optimized with multi-view calibrated images to represent scene geometry and appearance without explicit 3D supervision. Object segmentation can enrich many downstream applications based on the learned radiance field. However, introducing hand-crafted segmentation to define regions of interest in a complex real-world scene is non-trivial and expensive as it acquires per view annotation. This paper carries out the exploration of self-supervised learning for object segmentation using NeRF for complex real-world scenes. Our framework, called NeRF with Self-supervised Object Segmentation (NeRF-SOS), couples object segmentation and neural radiance field to segment objects in any view within a scene. By proposing a novel collaborative contrastive loss in both appearance and geometry levels, NeRF-SOS encourages NeRF models to distill compact geometry-aware segmentation clusters from their density fields and the self-supervised pre-trained 2D visual features. The self-supervised object segmentation framework can be applied to various NeRF models that both lead to photo-realistic rendering results and convincing segmentation maps for both indoor and outdoor scenarios. Extensive results on the LLFF, BlendedMVS, CO3Dv2, and Tank & Temples datasets validate the effectiveness of NeRF-SOS. It consistently surpasses 2D-based self-supervised baselines and predicts finer object masks than existing supervised counterparts.

1. INTRODUCTION

Scene modeling and representation are essential to the computer vision community. For instance, portable Augmented Reality (AR) devices such as the Magic Leap One can reconstruct the scene geometry and localize users (DeChicchis, 2020) . but they often struggle to comprehend the surrounding objects. This limitation poses challenges when designing interactions between humans and the environment. Although human-annotated data from diverse environments could mitigate the hurdles of understanding and segmenting the surrounding objects, collecting such data is often costly and time-consuming. Therefore, there is growing interest in developing intelligent geometry modeling frameworks that can learn from unsupervised or self-supervised techniques. Recently, neural volumetric rendering techniques, such as neural radiance field (NeRF) and its variants (Mildenhall et al., 2020a; Zhang et al., 2020; Barron et al., 2021) , have demonstrated exceptional performance in scene reconstruction, utilizing multi-layer perceptrons (MLPs) and calibrated multi-view images to generate fine-grained, unseen views. While several recent works have explored scene understanding with these techniques (Vora et al., 2021; Yang et al., 2021; Zhi et al., 2021) , they often require either dense view annotations to train a heavy 3D backbone for capturing semantic representations (Vora et al., 2021; Yang et al., 2021) , or human intervention to provide sparse semantic labels (Zhi et al., 2021) . Although recent self-supervised object discovery approaches on neural radiance fields (Yu et al., 2021c; Stelzner et al., 2021) have been effective in decomposing objects on synthetic indoor data, there is still a significant gap to be filled in applying these approaches to complex real-world scenarios.

