DM-NERF: 3D SCENE GEOMETRY DECOMPOSITION AND MANIPULATION FROM 2D IMAGES

Abstract

In this paper, we study the problem of 3D scene geometry decomposition and manipulation from 2D views. By leveraging the recent implicit neural representation techniques, particularly the appealing neural radiance fields, we introduce an object field component to learn unique codes for all individual objects in 3D space only from 2D supervision. The key to this component is multiple carefully designed loss functions to enable every 3D point, especially in non-occupied space, to be effectively optimized without 3D labels. In addition, we introduce an inverse query algorithm to freely manipulate any specified 3D object shape in the learned scene representation. Notably, our manipulation algorithm can explicitly tackle key issues such as object collisions and visual occlusions. Our method, called DM-NeRF, is among the first to simultaneously reconstruct, decompose, manipulate and render complex 3D scenes in a single pipeline. Extensive experiments on three datasets clearly show that our method can accurately decompose all 3D objects from 2D views, allowing any interested object to be freely manipulated in 3D space such as translation, rotation, size adjustment, and deformation.

1. INTRODUCTION

In many cutting-edge applications such as mixed reality on mobile devices, users may desire to virtually manipulate objects in 3D scenes, such as moving a chair or making a flying broomstick in a 3D room. This would allow users to easily edit real scenes at fingertips and view objects from new perspectives. However, this is particularly challenging as it involves 3D scene reconstruction, decomposition, manipulation, and photorealistic rendering in a single framework (Savva et al., 2019) . A traditional pipeline firstly reconstructs explicit 3D structures such as point clouds or polygonal meshes using SfM/SLAM techniques (Ozyesil et al., 2017; Cadena et al., 2016) , and then identifies 3D objects followed by manual editing. However, these explicit 3D representations inherently discretize continuous surfaces, and changing the shapes often requires additional repair procedures such as remeshing (Alliez et al., 2002) . Such discretized and manipulated 3D structures can hardly retain geometry and appearance details, resulting in the generated novel views to be unappealing. Given this, it is worthwhile to design a new pipeline which can recover continuous 3D scene geometry only from 2D views and enable object decomposition and manipulation. Recently, implicit representations, especially NeRF (Mildenhall et al., 2020) , emerge as a promising tool to represent continuous 3D geometries from images. A series of succeeding methods (Boss et al., 2021; Chen et al., 2021; Zhang et al., 2021c) are rapidly developed to decouple lighting factors from structures, allowing free edits of illumination and materials. However, they fail to decompose 3D scene geometries into individual objects. Therefore, it is hard to manipulate individual object shapes in complex scenes. Recent works (Stelzner et al., 2021; Zhang et al., 2021b; Kania et al., 2022; Yuan et al., 2022; Tschernezki et al., 2022; Kobayashi et al., 2022; Kim et al., 2022; Benaim et al., 2022; Ren et al., 2022) have started to learn disentangled shape representations for potential geometry manipulation. However, they either focus on synthetic scenes or simple model segmentation, and can hardly extend to real-world 3D scenes with dozens of objects. 

Object Field Component Object Manipulation

Ø Step 1: Ø Step 2: Ø Step 3: ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Object Code Figure 1 : The general workflow of DM-NeRF. NeRF (green block) is used as the backbone. We propose the object field and manipulation components as illustrated by blue and orange blocks. In this paper, we aim to simultaneously recover continuous 3D scene geometry, segment all individual objects in 3D space, and support flexible object shape manipulation such as translation, rotation, size adjustment and deformation. In addition, the edited 3D scenes can be also rendered from novel views. However, this task is extremely challenging as it requires: 1) an object decomposition approach amenable to continuous and implicit 3D fields, without relying on any 3D labels for supervision due to the infeasibility of collecting labels in continuous 3D space; 2) an object manipulation method agreeable to the learned implicit and decomposed fields, with an ability to clearly address visual occlusions inevitably caused by manipulation. To tackle these challenges, we design a simple pipeline, DM-NeRF, which is built on the successful NeRF, but able to decompose the entire 3D space into object fields and freely manipulate their geometries for realistic novel view rendering. As shown in Figure 1 , it consists of 3 major components: 1) the existing radiance field to learn volume density and appearance for every 3D point in space; 2) the object field which learns a unique object code for every 3D point; 3) the object manipulator that directly edits the shape of any specified object and automatically tackles visual occlusions. The object field is the core of DM-NeRF. This component aims to predict a one-hot vector, i.e., object code, for every 3D point in the entire scene space. However, learning such code involves critical issues: 1) there are no ground truth 3D object codes available for full supervision; 2) the number of total objects is variable and there is no fixed order for objects; 3) the non-occupied (empty) 3D space must be taken into account, but there are no labels for supervision as well. In Section 3.1, we show that our object field together with multiple carefully designed loss functions can address them properly, under the supervision of color images with 2D object masks only. Once the object field is well learned, our object manipulator aims to directly edit the geometry and render novel views when specifying the target objects, viewing angels, and manipulation settings. A naïve method is to obtain explicit 3D structures followed by manual editing and rendering, so that any shape occlusion and collision can be explicitly addressed. However, it is extremely inefficient to evaluate dense 3D points from implicit fields. To this end, as detailed in Section 3.2, we introduce a lightweight inverse query algorithm to automatically edit the scene geometry. Overall, our pipeline can simultaneously recover 3D scene geometry, decompose and manipulate object instances only from 2D images. Extensive experiments on multiple datasets demonstrate that our method can precisely segment all 3D objects and effectively edit 3D scene geometry, without sacrificing high fidelity of novel view rendering. Our key contributions are: • We propose an object field to directly learn a unique code for each object in 3D space only from 2D images, showing remarkable robustness and accuracy over the commonly-used individual image based segmentation methods. • We propose an inverse query algorithm to effectively edit specified object shapes, while generating realistic scene images from novel views. • We demonstrate superior performance for 3D decomposition and manipulation, and also contribute the first synthetic dataset for quantitative evaluation of 3D scene editing. Our code and dataset are available at https://github.com/vLAR-group/DM-NeRF We note that recent works ObjectNeRF (Yang et al., 2021 ), NSG (Ost et al., 2021 ) and ObjectSDF (Wu et al., 2022) address the similar task as ours. However, ObjectNeRF only decomposes a foreground object, NSG focuses on decomposing dynamic objects, and ObjectSDF only uses semantic label as regularization. None of them directly learns to segment multiple 3D objects as ours. A few works (Norman et al., 2022; Kundu et al., 2022; Fu et al., 2022) tackle panoptic segmentation in radiance fields. However, they fundamentally segment objects in 2D images followed by learning a separate radiance field for each object. By comparison, our method learns to directly segment all objects in the 3D scene radiance space, and it demonstrates superior accuracy and robustness than

