SCALABLE 3D OBJECT-CENTRIC LEARNING

Abstract

We tackle the task of unsupervised 3D object-centric representation learning on scenes of potentially unbounded scale. Existing approaches to object-centric representation learning exhibit significant limitations in achieving scalable inference due to their dependencies on a fixed global coordinate system. In contrast, we propose to learn view-invariant 3D object representations in localized object coordinate systems. To this end, we estimate the object pose and appearance representation separately and explicitly project object representations across views. We adopt amortized variational inference to process sequential input and update object representations online. To scale up our model to scenes with an arbitrary number of objects, we further introduce a Cognitive Map that allows the registration and querying of objects on a global map. We employ the object-centric neural radiance field (NeRF) as our 3D scene representation, which is jointly inferred by our unsupervised object-centric learning framework. Experimental results demonstrate that our method can infer and maintain object-centric representations of unbounded 3D scenes. Further combined with a per-object NeRF finetuning process, our model can achieve scalable high-quality object-aware scene reconstruction.

1. INTRODUCTION

The ability to understand 3D surroundings in an object-centric way is crucial for AI agents to perform a range of tasks including relational reasoning (Chang et al., 2017) and reinforcement learning (Diuk et al., 2008) . In recent years, 2D and 3D unsupervised object-centric learning have attracted increasing attention in the field. Compared with 2D object-centric learning methods (Eslami et al., 2016; Lin et al., 2020; Burgess et al., 2019; Crawford & Pineau, 2019; Locatello et al., 2020) that focus on decomposing 2D images into objects, 3D learning methods aim to recover the complete 3D scene structures in an object-centric way using RGB or RGBD observations (Li et al., 2020; Stelzner et al., 2021; Chen et al., 2021; Henderson & Lampert, 2020) . A key limitation of existing 3D methods is that they can only handle scenes with a scale that can fit into the field of view (FOV) of a fixed number of cameras (Li et al., 2020; Stelzner et al., 2021; Chen et al., 2021) , where the camera coordinates are specified within a global coordinate system. The trained inference models are thus highly dependent on the chosen global coordinate system (Li et al., 2020; Chen et al., 2021; Kabra et al., 2021; Eslami et al., 2018; Henderson & Lampert, 2020) , and cannot generalize beyond the scale of the training sets. All of these limits the applicability of existing 3D methods to real-world problems or even simulated reinforcement learning environments (Beattie et al., 2016) , where scenes with large or even unbounded scales are routinely encountered. In this paper, we propose Scalable Online Object Centric network in 3D (SOOC3D). SOOC3D addresses the problem of scalability by inferring object poses and view-invariant object representations in localized object coordinate systems from RGBD data. To handle sequential data for large-scale scenes, we exploit amortized variational inference for online inference. Inferred object poses allow object representations to be explicitly projected across views with preserved identities. To keep track of all the detected objects throughout the online update, we introduce a highly scalable external memory mechanism named Cognitive Map,foot_0 which can be used to dynamically register and query detected object representations. This memory mechanism further removes a constraint in existing works (Henderson & Lampert, 2020; Burgess et al., 2019; Locatello et al., 2020; Engelcke et al., 2020a; Yu et al., 2022) whereby the maximum number of objects allowed in each scene is capped. We adopt the 3D object-aware Neural Radiance field (NeRF) to decode such representations to 3D geometries for training. While per-scene NeRF with direct SGD optimization can capture detailed 3D scenes (Mildenhall et al., 2021; Zhang et al., 2022; Tancik et al., 2022) , the reconstruction quality of such unsupervised object-centric NeRF learning methods commonly falls short of the per-scene NeRF approaches as the introduced information bottlenecks filter out high-frequency information (Engelcke et al., 2020b) . To narrow the gap, we introduce the per-object NeRF finetuning process to improve the reconstruction quality while preserving the objects' identities. Our contributions are summarised as follows. i) We propose, to the best of our knowledge, the first unbounded scalable generative-model-based unsupervised 3D object-centric learning framework. ii) We learn the explicit object poses and view-invariant object representations separately via the amortized variational inference framework to achieve scalable online updating. iii) To store a potentially unbounded number of objects detected for scalable inference, we introduce Cognitive Map separating object representations management from the inference process. iv) We demonstrate that the reconstruction quality can be further improved via our per-object NeRF finetuning process with preserved objects' identities. Figure 1 : Left: The interaction between the inference pipeline and the cognitive map for scene updating (top) and novel view synthesis (bottom). Previously detected objects are registered in the cognitive map. When a new view arrives, the representations of all objects present in the view are retrieved. {λ i } are the object representations and λ s is the scene layout representation. If the number of objects existing in the current view is less than a pre-defined value K, we add prior representations (greyed λ i ). We update {λ i } to integrate new information with the amortized variational inference process and register them back into the cognitive map. For novel view synthesis, the retrieved representations are decoded into NeRF components. By components composition, we synthesize RGB image, depth, segmentation and uncertainty map. Right: A L-iteration amortized variational inference process. In each iteration, the set of input representations is decoded into NeRF components. We evaluate the likelihood of observation under the composed NeRF. The refinement network takes the decoded NeRF, observation and likelihood to update the object-centric representation.

2. RELATED WORK

Unsupervised Generative Model-based 2D Object-centric Learning. 2D object-centric learning aims to group pixels covering the same object under the same label and at the same time produce a neural representation of each discovered object. At the core of those methods is the spatial mixture model formulation that frames object-centric learning as a latent variable inference problem (Eslami et al., 2016; Greff et al., 2019; 2017) . To handle observations with a high object density, a branch of works (Eslami et al., 2016; Crawford & Pineau, 2019; Lin et al., 2020) infers latent variables for local regions of each 2D observation. Pipelines equipped with iterative refinement modules (Greff et al., 2017; Locatello et al., 2020) refine the latent variable iteratively conditioned on an input view. Particularly, IODINE (Greff et al., 2019) employs amortized variational inference (Marino et al., 2018) that can process sequential data. However, the aforementioned methods do not infer 3D structures. Object latents are discarded once out of view. Unsupervised Generative Model-based 3D Object-centric Learning. 3D-aware methods not only try to factorize observations in an object-centric manner but also infer the 3D spatial structure of scenes, which can be examined by the means of novel views synthesis. Similar to its 2D counterpart,



The term cognitive map is borrowed from cognitive psychology studies on mental representations of the spatial surroundings in animal, and human brain(Kitchin, 1994).

