SEGNERF: 3D PART SEGMENTATION WITH NEURAL RADIANCE FIELDS

Abstract

Recent advances in Neural Radiance Fields (NeRF) boast impressive performances for generative tasks such as novel view synthesis and 3D reconstruction. Methods based on neural radiance fields are able to represent the 3D world implicitly by relying exclusively on posed images. Yet, they have seldom been explored in the realm of discriminative tasks such as 3D part segmentation. In this work, we attempt to bridge that gap by proposing SegNeRF: a neural field representation that integrates a semantic field along with the usual radiance field. SegNeRF 1 inherits from previous works the ability to perform novel view synthesis and 3D reconstruction, and enables 3D part segmentation from a few images. Our extensive experiments on PartNet show that SegNeRF is capable of simultaneously predicting geometry, appearance, and semantic information from posed images, even for unseen objects. The predicted semantic fields allow SegNeRF to achieve an average mIoU of 30.30% for 2D novel view segmentation, and 37.46% for 3D part segmentation, boasting competitive performance against point-based methods by using only a few posed images. Additionally, SegNeRF is able to generate an explicit 3D model from a single image of an object taken in the wild, with its corresponding part segmentation.



The source view is used to generate a feature grid, which is queried with a set of (i) ray points for volume rendering, (ii) an object point cloud for 3D semantic part segmentation, or (iii) a point grid for 3D reconstruction. Training is supervised only through images in the form of 2D reconstruction and segmentation losses. However, at test time, our model is also capable of generating 3D semantic segmentation and reconstruction.

1. INTRODUCTION

Humans are able to perceive and understand the objects that surround them. Impressively, we understand objects even though our visual system only perceives partial information through 2D projections, i.e. images. Despite possible occlusions, a single image allows us to infer an object's geometry, estimate its pose, and recognize its parts and their location in a 3D space. Inspired by this capacity of using 2D projections to understand 3D objects, we aim to understand object part semantics in 3D space by solely relying on image-level supervision. Most works learn object part segmentation by leveraging 3D supervision, i.e. point clouds or meshes (Wang & Lu, 2019; Qi et al., 2017b; Li et al., 2018) . However, given the accessibility of camera sensors, it is easier at inference time to have access to a collection of images rather than a 3D scan of the object. To exploit real-world scenarios, it would thus be advantageous to have the ability of understanding 3D information from image-level supervision. For that purpose, Neural Radiance Fields (NeRF, Mildenhall et al., 2020) emerged as a cornerstone work in novel-view synthesis. NeRFs showed how to learn a 3D object's representation solely based on image supervision. While NeRF's seminal achievement was learning individual scenes with remarkable details, its impact grew beyond the single-scene setup, spurring follow-up works on speed (Müller et al., 2022; Yu et al., 2021a) , scale (Martin-Brualla et al., 2021; Xiangli et al., 2021) , and generalization (Jang & Agapito, 2021; Yu et al., 2021b) . Despite the rapid advances in NeRF, only a handful of works (Zhi et al., 2021; Vora et al., 2021) have leveraged volume rendering for semantic segmentation or 3D reconstruction. Yet, these capabilities are a defining feature of 3D understanding, with potential applications in important downstream tasks. In this work, we present SegNeRF: an implicit representation framework for simultaneously learning geometric, visual, and semantic fields from posed images, as shown in Figure 1 . At training time, SegNeRF only requires paired posed colored images and semantic masks of objects, inheriting NeRF's independence of direct geometric supervision. At inference time, SegNeRF projects appearance and semantic labels on multiple views, and can thus accommodate a variable number of input views without expensive on-the-fly optimization. SegNeRF inherits the ability to perform 3D reconstruction and novel view synthesis from previous works (PixelNeRF, Yu et al., 2021b) . However, we demonstrate the ability to learn 3D semantic information through volume rendering by leveraging a semantic field. We validate SegNeRF's performance on PartNet (Mo et al., 2019) , a large-scale dataset of 3D objects annotated with semantic labels, and show that our approach performs on-par with point-based methods without the need for any 3D supervision. Contributions. We summarize our contributions as follows. (i) We present SegNeRF, a versatile 3D implicit representation that jointly learns appearance, geometry, and semantics from posed RGB images. (ii) We provide extensive experiments validating the capacity of SegNeRF for 3D part segmentation despite relying exclusively on image supervision during training. (iii) To the best of our knowledge, SegNeRF is the first multi-purpose implicit representation capable of jointly reconstructing and segmenting novel objects without expensive test-time optimization.

2.1. 3D GEOMETRICAL REPRESENTATIONS

The classical representations for data-driven 3D learning systems can be divided into three groups: voxel-, point-, and mesh-based representations. Voxels are a simple 3D extension of pixel representations; however, their memory footprint grows cubically with resolution (Brock et al., 2016; Gadelha et al., 2017) . While point clouds are more memory-efficient, they need post-processing to recover missing connectivity information (Fan et al., 2017; Achlioptas et al., 2018) . Most meshbased representations do not require post-processing, but they are often based on deforming a fixed size template mesh, hindering the processing of arbitrary shapes (Kanazawa et al., 2018; Ranjan et al., 2018) . To alleviate these problems, there has been a strong focus on implicitly representing 3D data via neural networks (Mescheder et al., 2019; Park et al., 2019; Sitzmann et al., 2019; Mildenhall et al., 2020) . The fundamental idea behind these methods is using a neural network f (x) to model certain physical properties (e.g., occupancy , distance to the surface, color, density, or illumination) for all points x in 3D space.



We will release our code publicly for reproducibility.



𝑥 , d, E 𝛑(x) 𝛄(𝑥), E 𝛑(x) 𝛄(𝑥), E 𝛑(x) E 𝛑(x)

Figure 1: SegNeRF framework: Implicit Representation with Neural Radiance Fields for 2D novel view Semantic Segmentation, as well as 3D Segmentation and Reconstruction. Our model takes as input one or more source views of an object (top-left image).The source view is used to generate a feature grid, which is queried with a set of (i) ray points for volume rendering, (ii) an object point cloud for 3D semantic part segmentation, or (iii) a point grid for 3D reconstruction. Training is supervised only through images in the form of 2D reconstruction and segmentation losses. However, at test time, our model is also capable of generating 3D semantic segmentation and reconstruction.

