UNSUPERVISED 3D SCENE REPRESENTATION LEARN-ING VIA MOVABLE OBJECT INFERENCE

Abstract

Unsupervised, category-agnostic, object-centric 3D representation learning for complex scenes remains an open problem in computer vision. While a few recent methods can now discover 3D object radiance fields from a single image without supervision, they are limited to simplistic scenes with objects of a single category, often with a uniform color. This is because they discover objects purely based on appearance cues-objects are made of pixels that look alike. In this work, we propose Movable Object Radiance Fields (MORF), aiming at scaling to complex scenes with diverse categories of objects. Inspired by cognitive science studies of object learning in babies, MORF learns 3D object representations via movable object inference. During training, MORF first obtains 2D masks of movable objects via a self-supervised movable object segmentation method; it then bridges the gap to 3D object representations via conditional neural rendering in multiple views. During testing, MORF can discover, reconstruct, and move unseen objects from novel categories, all from a single image. Experiments show that MORF extracts accurate object geometry and supports realistic object and scene reconstruction and editing, significantly outperforming the state-of-the-art.

1. INTRODUCTION

Learning object-centric 3D representations of complex scenes is a critical precursor to a wide range of application domains in vision, robotics, and graphics. The ability to factorize a scene into objects provides the flexibility of querying the properties of individual objects, which greatly facilitates downstream tasks such as visual reasoning, visual dynamics prediction, manipulation, and scene editing. Furthermore, we hypothesize that building factorized representations provides a strong inductive bias for compositional generalization (Greff et al., 2020) , which in turn enables the model to understand novel scenes with previously unseen objects and configurations. While supervised learning methods have shown promise in learning 3D object representations (such as neural radiance fields (Mildenhall et al., 2020) ) from images (Ost et al., 2021; Kundu et al., 2022; Müller et al., 2022) , they rely on annotations of specific object and scene categories. A recent line of work (Yu et al., 2022; Stelzner et al., 2021) has explored the problem of unsupervised discovery of object radiance fields. These models can be trained from multi-view RGB or RGB-D images to learn object-centric 3D scene decomposition without annotations of object segments and categories. However, they are only demonstrated to work well on simplistic scenes. It remains challenging to obtain accurate reconstructions on more complex datasets. A fundamental reason is that they heavily rely on visual appearance similarity to discover object entities, which limits their scalability beyond simple texture-less objects. In this work, we aim to scale unsupervised 3D object-centric representation learning to complex visual scenes with textured objects from diverse categories. To this end, we propose a Movable Object Radiance Fields (MORF) model, which learns to infer 3D object radiance fields from a single image. Rather than appearance similarity, the underlying principle of MORF uses to discover object entities is material coherence under everyday physical actions, i.e., an object is movable as a whole in 3D space (Spelke, 1990) . However, it is challenging to obtain learning signals to directly infer movable objects in 3D. MORF addresses this problem by integrating a recent self-supervised 2D movable object segmentation method, EISEN (Chen et al., 2022) , to extract movable object segments in 2D images, as well as differentiable neural rendering to bridge the gap between 2D learning signals and

