SEGNERF: 3D PART SEGMENTATION WITH NEURAL RADIANCE FIELDS

Abstract

Recent advances in Neural Radiance Fields (NeRF) boast impressive performances for generative tasks such as novel view synthesis and 3D reconstruction. Methods based on neural radiance fields are able to represent the 3D world implicitly by relying exclusively on posed images. Yet, they have seldom been explored in the realm of discriminative tasks such as 3D part segmentation. In this work, we attempt to bridge that gap by proposing SegNeRF: a neural field representation that integrates a semantic field along with the usual radiance field. SegNeRF 1 inherits from previous works the ability to perform novel view synthesis and 3D reconstruction, and enables 3D part segmentation from a few images. Our extensive experiments on PartNet show that SegNeRF is capable of simultaneously predicting geometry, appearance, and semantic information from posed images, even for unseen objects. The predicted semantic fields allow SegNeRF to achieve an average mIoU of 30.30% for 2D novel view segmentation, and 37.46% for 3D part segmentation, boasting competitive performance against point-based methods by using only a few posed images. Additionally, SegNeRF is able to generate an explicit 3D model from a single image of an object taken in the wild, with its corresponding part segmentation.



The source view is used to generate a feature grid, which is queried with a set of (i) ray points for volume rendering, (ii) an object point cloud for 3D semantic part segmentation, or (iii) a point grid for 3D reconstruction. Training is supervised only through images in the form of 2D reconstruction and segmentation losses. However, at test time, our model is also capable of generating 3D semantic segmentation and reconstruction. We will release our code publicly for reproducibility.1



𝑥 , d, E 𝛑(x) 𝛄(𝑥), E 𝛑(x) 𝛄(𝑥), E 𝛑(x) E 𝛑(x)

Figure 1: SegNeRF framework: Implicit Representation with Neural Radiance Fields for 2D novel view Semantic Segmentation, as well as 3D Segmentation and Reconstruction. Our model takes as input one or more source views of an object (top-left image).The source view is used to generate a feature grid, which is queried with a set of (i) ray points for volume rendering, (ii) an object point cloud for 3D semantic part segmentation, or (iii) a point grid for 3D reconstruction. Training is supervised only through images in the form of 2D reconstruction and segmentation losses. However, at test time, our model is also capable of generating 3D semantic segmentation and reconstruction.

