OBPOSE: LEVERAGING POSE FOR OBJECT-CENTRIC SCENE INFERENCE AND GENERATION IN 3D

Abstract

We present OBPOSE, an unsupervised object-centric inference and generation model which learns 3D-structured latent representations from RGB-D scenes. Inspired by prior art in 2D representation learning, OBPOSE considers a factorised latent space, separately encoding object location (where) and appearance (what). OBPOSE further leverages an object's pose (i.e. location and orientation), defined via a minimum volume principle, as a novel inductive bias for learning the where component. To achieve this, we propose an efficient, voxelised approximation approach to recover the object shape directly from a neural radiance field (NeRF). As a consequence, OBPOSE models each scene as a composition of NeRFs, richly representing individual objects. To evaluate the quality of the learned representations, OBPOSE is evaluated quantitatively on the YCB, MultiShapeNet, and CLEVR datatasets for unsupervised scene segmentation, outperforming the current state-ofthe-art in 3D scene inference (ObSuRF) by a significant margin. Generative results provide qualitative demonstration that the same OBPOSE model can both generate novel scenes and flexibly edit the objects in them. These capacities again reflect the quality of the learned latents and the benefits of disentangling the where and what components of a scene. Key design choices made in the OBPOSE encoder are validated with ablations.

1. INTRODUCTION

In recent years, object-centric representations have emerged as a paradigm shift in machine perception. Intuitively, inference or prediction tasks in down-stream applications are significantly simplified by reducing the dimensionality of the hypothesis space from raw perceptual inputs, such as pixels or point-clouds, to something more akin to a traditional state-space representation. While reasoning over objects rather than pixels has long been the aspiration of machine vision research, it is the ability to learn such a representation in an unsupervised, generative way that unlocks the use of large-scale, unlabelled data for this purpose. As consequence, research into object-centric generative models (OCGMs) is rapidly gathering pace. Central to the success of an OCGM are the inductive biases used to encourage the decomposition of a scene into its constituent components. With the field still largely in its infancy, much of the work to date has confined itself to 2D scene observations to achieve both scene inference (e.g. Eslami et al., 2016; Burgess et al., 2019; Greff et al., 2016; 2017; Locatello et al., 2020; Engelcke et al., 2019; 2021) and, in some cases, generation (e.g. Engelcke et al., 2019; 2021) . In contrast, unsupervised methods for object-centric scene decomposition operating directly on 3D inputs remain comparatively unexplored (Elich et al., 2022; Stelzner et al., 2021) -despite the benefits due to the added information contained in the input. As a case in point, Stelzner et al. ( 2021) recently established that access to 3D information significantly speeds up learning. Another benefit is that, for the parts of an object visible to a 3D sensor, object shape is readily accessible and does not have to be inferred, either from a single view (e.g. Liu et al., 2019; Kato et al., 2018) or from multiple views (e.g. Yu et al., 2021; Xie et al., 2019) . We conjecture that object shape can serve as a highly informative inductive bias for object-centric learning. As we elaborate below, we reason that the asymmetry of a shape can be used to discover an object's pose, and pose can help to identify and locate an object in space. Here we present OBPOSE, an unsupervised OCGM that takes RGB-D images (or video) as input and learns to segment the underlying scene into its constituent 3D objects, as well as into an explicit background representation. As we will show, OBPOSE can also be used for scene generation and editing. Inspired by prior art in 2D settings (Eslami et al., 2016; Crawford & Pineau, 2019; Lin et al., 2020; Kosiorek et al., 2018; Jiang et al., 2019; Wu et al., 2021) , OBPOSE factorises its

