UNSUPERVISED 3D SCENE REPRESENTATION LEARN-ING VIA MOVABLE OBJECT INFERENCE

Abstract

Unsupervised, category-agnostic, object-centric 3D representation learning for complex scenes remains an open problem in computer vision. While a few recent methods can now discover 3D object radiance fields from a single image without supervision, they are limited to simplistic scenes with objects of a single category, often with a uniform color. This is because they discover objects purely based on appearance cues-objects are made of pixels that look alike. In this work, we propose Movable Object Radiance Fields (MORF), aiming at scaling to complex scenes with diverse categories of objects. Inspired by cognitive science studies of object learning in babies, MORF learns 3D object representations via movable object inference. During training, MORF first obtains 2D masks of movable objects via a self-supervised movable object segmentation method; it then bridges the gap to 3D object representations via conditional neural rendering in multiple views. During testing, MORF can discover, reconstruct, and move unseen objects from novel categories, all from a single image. Experiments show that MORF extracts accurate object geometry and supports realistic object and scene reconstruction and editing, significantly outperforming the state-of-the-art.

1. INTRODUCTION

Learning object-centric 3D representations of complex scenes is a critical precursor to a wide range of application domains in vision, robotics, and graphics. The ability to factorize a scene into objects provides the flexibility of querying the properties of individual objects, which greatly facilitates downstream tasks such as visual reasoning, visual dynamics prediction, manipulation, and scene editing. Furthermore, we hypothesize that building factorized representations provides a strong inductive bias for compositional generalization (Greff et al., 2020) , which in turn enables the model to understand novel scenes with previously unseen objects and configurations. While supervised learning methods have shown promise in learning 3D object representations (such as neural radiance fields (Mildenhall et al., 2020) ) from images (Ost et al., 2021; Kundu et al., 2022; Müller et al., 2022) , they rely on annotations of specific object and scene categories. A recent line of work (Yu et al., 2022; Stelzner et al., 2021) has explored the problem of unsupervised discovery of object radiance fields. These models can be trained from multi-view RGB or RGB-D images to learn object-centric 3D scene decomposition without annotations of object segments and categories. However, they are only demonstrated to work well on simplistic scenes. It remains challenging to obtain accurate reconstructions on more complex datasets. A fundamental reason is that they heavily rely on visual appearance similarity to discover object entities, which limits their scalability beyond simple texture-less objects. In this work, we aim to scale unsupervised 3D object-centric representation learning to complex visual scenes with textured objects from diverse categories. To this end, we propose a Movable Object Radiance Fields (MORF) model, which learns to infer 3D object radiance fields from a single image. Rather than appearance similarity, the underlying principle of MORF uses to discover object entities is material coherence under everyday physical actions, i.e., an object is movable as a whole in 3D space (Spelke, 1990) . However, it is challenging to obtain learning signals to directly infer movable objects in 3D. MORF addresses this problem by integrating a recent self-supervised 2D movable object segmentation method, EISEN (Chen et al., 2022) , to extract movable object segments in 2D images, as well as differentiable neural rendering to bridge the gap between 2D learning signals and 3D inference. MORF learns to conditionally infer object radiance fields from the segmented images, which provide strong inductive bias for object-centric factorization of 3D scenes. Specifically, we pretrain EISEN on optical flow from unlabeled videos. EISEN learns object image segmentations by perceptually grouping parts of a scene that would move as cohesive wholes, serving as a module that estimates high-quality object segmentations on static images. After pretraining, MORF learns to extract object-centric latent representations from segmented images and generate object radiance fields from the factorized latents. To facilitate high-quality reconstruction of textured objects, our latent object representation consists of both an entity-level latent and a pixel-level latent that better encodes appearance details. To evaluate our method, we propose a challenging dataset with a diverse set of realistic-looking objects, going beyond simplistic scenes considered by most current unsupervised 3D object discovery methods (Yu et al., 2022; Stelzner et al., 2021; Sajjadi et al., 2022b) . We demonstrate that MORF can learn high-quality 3D object-centric representations in complex visual scenes, allowing photometric and geometric reconstruction for these scenes from single views (Figure 1 ). Moreover, our learned representations enable 3D scene manipulation tasks such as moving, rotating, and replacing objects, and changing the background of complex scenes. Beyond systematic generalization to unseen spatial layouts and arrangements, we further show that MORF is able to generalize to unseen object categories and appearances while maintaining reasonable reconstruction and geometry estimation quality. In summary, our contributions are three-fold. First, we propose scaling the learning of unsupervised, category-agnostic, object-centric 3D representation learning beyond simplistic scenes by discovering objects with coherent motion, in addition to visual appearance. Second, to instantiate our idea, we propose Movable Object Radiance Fields (MORF), which integrates 2D movable object segmentation with neural rendering to allow 3D movable object discovery. Third, we demonstrate that MORF allows photometric and geometric reconstruction and editing of complex 3D scenes with textured objects from diverse unseen categories.

2. RELATED WORK

Unsupervised 2D object discovery Our method is closely related to recent work on unsupervised scene decomposition, which aims to decompose multi-object scenes into separate object-centric representations from images without human annotations. Most works formulate the problem as learning compositional generative models in the 2D image space. They decompose a visual scene into a set of localized object-centric patches (Eslami et al., 2016; Crawford & Pineau, 2019; Kosiorek et al., 2018; Lin et al., 2020; Jiang et al., 2019a) or a set of scene mixture components (Burgess et al., 2019; Greff et al., 2019; 2016; 2017; Engelcke et al., 2019; Locatello et al., 2020; Monnier et al., 2021; Jiang et al., 2019b) . The scene mixture models typically generate single-object RGBA images and blend them to reconstruct the full-scene images using iterative inference with recurrent networks (Burgess et al., 2019) or set-based convolutional encoders (Locatello et al., 2020) . However, these methods have so far been unable to scale to complex real-world images. A recent branch of work on self-supervised object segmentations explores additional supervision signals such as motions and depth for learning object segmentations (Bear et al., 2020; Kipf et al., 2021; Chen et al., 2022;  



Figure 1: Illustration of unsupervised, category-agnostic, object-centric 3D representation learning. Given a single image, our goal is to infer object radiance fields that allow photometric and geometric 3D reconstruction. This factorized representation enables 3D scene manipulation, including moving object and replacing the background.

