NEURAL GROUNDPLANS: PERSISTENT NEURAL SCENE REPRESENTATIONS FROM A SINGLE IMAGE

Abstract

We present a method to map 2D image observations of a scene to a persistent 3D scene representation, enabling novel view synthesis and disentangled representation of the movable and immovable components of the scene. Motivated by the bird'seye-view (BEV) representation commonly used in vision and robotics, we propose conditional neural groundplans, ground-aligned 2D feature grids, as persistent and memory-efficient scene representations. Our method is trained self-supervised from unlabeled multi-view observations using differentiable rendering, and learns to complete geometry and appearance of occluded regions. In addition, we show that we can leverage multi-view videos at training time to learn to separately reconstruct static and movable components of the scene from a single image at test time. The ability to separately reconstruct movable objects enables a variety of downstream tasks using simple heuristics, such as extraction of object-centric 3D representations, novel view synthesis, instance-level segmentation, 3D bounding box prediction, and scene editing. This highlights the value of neural groundplans as a backbone for efficient 3D scene understanding models.

1. INTRODUCTION

We study the problem of inferring a persistent 3D scene representation given a few image observations, while disentangling static scene components from movable objects (referred to as dynamic). Recent works in differentiable rendering have made significant progress in the long-standing problem of 3D reconstruction from small sets of image observations (Yu et al., 2020; Sitzmann et al., 2019b; Sajjadi et al., 2021) . Approaches based on pixel-aligned features (Yu et al., 2020; Trevithick & Yang, 2021; Henzler et al., 2021) have achieved plausible novel view synthesis of scenes composed of independent objects from single images. However, these methods do not produce persistent 3D scene representations that can be directly processed in 3D, for instance, via 3D convolutions. Instead, all processing has to be performed in image space. In contrast, some methods infer 3D voxel grids, enabling processing such as geometry and appearance completion via shift-equivariant 3D convolutions (Lal et al., 2021; Guo et al., 2022) , which is however expensive both in terms of computation and memory. Meanwhile, bird's-eye-view (BEV) representations, 2D grids aligned with the ground plane of a scene, have been fruitfully deployed as state representations for navigation, layout generation, and future frame prediction (Saha et al., 2022; Philion & Fidler, 2020; Roddick et al., 2019; Jeong et al., 2022; Mani et al., 2020) . While they compress the height axis and are thus not a full 3D representation, 2D convolutions on top of BEVs retain shift-equivariance in the ground plane and are, in contrast to image-space convolutions, free of perspective camera distortions. Inspired by BEV representations, we propose conditional neural groundplans, 2D grids of learned features aligned with the ground plane of a 3D scene, as a persistent 3D scene representation et al., 2019; Mani et al., 2020) and enable 3D queries by projecting a 3D point onto the groundplan, retrieving the respective feature, and decoding it via an MLP into a full 3D scene. This enables self-supervised training via differentiable volume rendering. By compactifying 3D space with a nonlinear mapping, neural groundplans can encode unbounded 3D scenes in a bounded region. We further propose to reconstruct separate neural groundplans for 3D regions of a scene that are movable and 3D regions of a scene that are static given a single input image. This requires that objects are moving in the training data, enabling us to learn a prior to predict which parts of a scene are movable and static from a single image at test time. We achieve this additional factorization by training on multi-view videos, such as those available from cameras at traffic intersections or sports game footage. Our model is trained self-supervised via neural rendering without pseudo-ground truth, bounding boxes, or any instance labels. We demonstrate that separate reconstruction of movable objects enables instance-level segmentation, recovery of 3D object-centric representations, and 3D bounding box prediction via a simple heuristic leveraging that connected regions of 3D space that move together belong to the same object. This further enables intuitive 3D editing of the scene. Since neural groundplans are 2D grids of features without perspective camera distortion, shiftequivariant processing using inexpensive 2D CNNs effectively completes occluded regions. Our model thus outperforms prior pixel-aligned approaches in the synthesis of novel views that observe 3D regions that are occluded in the input view. We further show that by leveraging motion cues at training time, our method outperforms prior work on the self-supervised discovery of 3D objects. In summary, our contributions are: • We introduce self-supervised training of conditional neural groundplans, a hybrid discretecontinuous 3D neural scene representation that can be reconstructed from a single image, enabling efficient processing of scene appearance and geometry directly in 3D. • We leverage object motion as a cue for disentangling static background and movable foreground objects given only a single input image. • Using the 3D geometry encoded in the dynamic groundplan, we demonstrate single-image 3D instance segmentation and 3D bounding box prediction, as well as 3D scene editing.

2. RELATED WORK

Neural Scene Representation and Rendering. Several works have explored learning neural scene representations for downstream tasks in 3D. Emerging neural scene representations enable reconstruction of geometry and appearance from images as well as high-quality novel view synthesis via differentiable rendering. A large part of recent work focuses on the case of reconstructing a single 3D scene given dense observations (Cheng et al., 2018; Tung et al., 2019; Sitzmann et al., 2019a; Lombardi et al., 2019; Mildenhall et al., 2020; Yariv et al., 2020; Tewari et al., 2021) . Alternatively, differentiable rendering may be used to supervise encoders to reconstruct scenes from a single or few images in a feedforward manner. Pixel-aligned conditioning enables reconstruction of compositional scenes (Yu et al., 2020; Trevithick & Yang, 2021) , but does not infer a compact 3D representation. Methods with a single latent code per scene do, but do not generalize to compositional scenes (Sitzmann et al., 2019c; Jang & Agapito, 2021; Niemeyer et al., 2020; Sitzmann et al., 2021; Kosiorek 



Figure 1: Given a single image, our model infers separate 3D representations for static and dynamic scene elements, enabling high-quality novel view synthesis with plausible completion, unsupervised instance-level segmentation, 3D bounding box prediction, 3D scene editing, and extraction of objectcentric 3D representations. Our model is trained self-supervised using unlabeled multi-view videos.reconstructed in a feed-forward manner. Neural groundplans are a hybrid discrete-continuous 3D neural scene representation(Chan et al., 2022; Peng et al., 2020; Philion & Fidler, 2020; Roddick  et al., 2019; Mani et al., 2020)  and enable 3D queries by projecting a 3D point onto the groundplan, retrieving the respective feature, and decoding it via an MLP into a full 3D scene. This enables self-supervised training via differentiable volume rendering. By compactifying 3D space with a nonlinear mapping, neural groundplans can encode unbounded 3D scenes in a bounded region. We further propose to reconstruct separate neural groundplans for 3D regions of a scene that are movable and 3D regions of a scene that are static given a single input image. This requires that objects are moving in the training data, enabling us to learn a prior to predict which parts of a scene are movable and static from a single image at test time. We achieve this additional factorization by training on multi-view videos, such as those available from cameras at traffic intersections or sports game footage. Our model is trained self-supervised via neural rendering without pseudo-ground truth, bounding boxes, or any instance labels. We demonstrate that separate reconstruction of movable objects enables instance-level segmentation, recovery of 3D object-centric representations, and 3D bounding box prediction via a simple heuristic leveraging that connected regions of 3D space that move together belong to the same object. This further enables intuitive 3D editing of the scene.

