SCALABLE 3D OBJECT-CENTRIC LEARNING

Abstract

We tackle the task of unsupervised 3D object-centric representation learning on scenes of potentially unbounded scale. Existing approaches to object-centric representation learning exhibit significant limitations in achieving scalable inference due to their dependencies on a fixed global coordinate system. In contrast, we propose to learn view-invariant 3D object representations in localized object coordinate systems. To this end, we estimate the object pose and appearance representation separately and explicitly project object representations across views. We adopt amortized variational inference to process sequential input and update object representations online. To scale up our model to scenes with an arbitrary number of objects, we further introduce a Cognitive Map that allows the registration and querying of objects on a global map. We employ the object-centric neural radiance field (NeRF) as our 3D scene representation, which is jointly inferred by our unsupervised object-centric learning framework. Experimental results demonstrate that our method can infer and maintain object-centric representations of unbounded 3D scenes. Further combined with a per-object NeRF finetuning process, our model can achieve scalable high-quality object-aware scene reconstruction.

1. INTRODUCTION

The ability to understand 3D surroundings in an object-centric way is crucial for AI agents to perform a range of tasks including relational reasoning (Chang et al., 2017) and reinforcement learning (Diuk et al., 2008) . In recent years, 2D and 3D unsupervised object-centric learning have attracted increasing attention in the field. Compared with 2D object-centric learning methods (Eslami et al., 2016; Lin et al., 2020; Burgess et al., 2019; Crawford & Pineau, 2019; Locatello et al., 2020) that focus on decomposing 2D images into objects, 3D learning methods aim to recover the complete 3D scene structures in an object-centric way using RGB or RGBD observations (Li et al., 2020; Stelzner et al., 2021; Chen et al., 2021; Henderson & Lampert, 2020) . A key limitation of existing 3D methods is that they can only handle scenes with a scale that can fit into the field of view (FOV) of a fixed number of cameras (Li et al., 2020; Stelzner et al., 2021; Chen et al., 2021) , where the camera coordinates are specified within a global coordinate system. The trained inference models are thus highly dependent on the chosen global coordinate system (Li et al., 2020; Chen et al., 2021; Kabra et al., 2021; Eslami et al., 2018; Henderson & Lampert, 2020) , and cannot generalize beyond the scale of the training sets. All of these limits the applicability of existing 3D methods to real-world problems or even simulated reinforcement learning environments (Beattie et al., 2016) , where scenes with large or even unbounded scales are routinely encountered. In this paper, we propose Scalable Online Object Centric network in 3D (SOOC3D). SOOC3D addresses the problem of scalability by inferring object poses and view-invariant object representations in localized object coordinate systems from RGBD data. To handle sequential data for large-scale scenes, we exploit amortized variational inference for online inference. Inferred object poses allow object representations to be explicitly projected across views with preserved identities. To keep track of all the detected objects throughout the online update, we introduce a highly scalable external memory mechanism named Cognitive Map,foot_0 which can be used to dynamically register and query detected object representations. This memory mechanism further removes a constraint in existing works (Henderson & Lampert, 2020; Burgess et al., 2019; Locatello et al., 2020; Engelcke et al., 2020a; Yu et al., 2022) whereby the maximum number of objects allowed in each scene is capped.



The term cognitive map is borrowed from cognitive psychology studies on mental representations of the spatial surroundings in animal, and human brain(Kitchin, 1994).

