OBPOSE: LEVERAGING POSE FOR OBJECT-CENTRIC SCENE INFERENCE AND GENERATION IN 3D

Abstract

We present OBPOSE, an unsupervised object-centric inference and generation model which learns 3D-structured latent representations from RGB-D scenes. Inspired by prior art in 2D representation learning, OBPOSE considers a factorised latent space, separately encoding object location (where) and appearance (what). OBPOSE further leverages an object's pose (i.e. location and orientation), defined via a minimum volume principle, as a novel inductive bias for learning the where component. To achieve this, we propose an efficient, voxelised approximation approach to recover the object shape directly from a neural radiance field (NeRF). As a consequence, OBPOSE models each scene as a composition of NeRFs, richly representing individual objects. To evaluate the quality of the learned representations, OBPOSE is evaluated quantitatively on the YCB, MultiShapeNet, and CLEVR datatasets for unsupervised scene segmentation, outperforming the current state-ofthe-art in 3D scene inference (ObSuRF) by a significant margin. Generative results provide qualitative demonstration that the same OBPOSE model can both generate novel scenes and flexibly edit the objects in them. These capacities again reflect the quality of the learned latents and the benefits of disentangling the where and what components of a scene. Key design choices made in the OBPOSE encoder are validated with ablations.

1. INTRODUCTION

In recent years, object-centric representations have emerged as a paradigm shift in machine perception. Intuitively, inference or prediction tasks in down-stream applications are significantly simplified by reducing the dimensionality of the hypothesis space from raw perceptual inputs, such as pixels or point-clouds, to something more akin to a traditional state-space representation. While reasoning over objects rather than pixels has long been the aspiration of machine vision research, it is the ability to learn such a representation in an unsupervised, generative way that unlocks the use of large-scale, unlabelled data for this purpose. As consequence, research into object-centric generative models (OCGMs) is rapidly gathering pace. Central to the success of an OCGM are the inductive biases used to encourage the decomposition of a scene into its constituent components. With the field still largely in its infancy, much of the work to date has confined itself to 2D scene observations to achieve both scene inference (e.g. Eslami et al., 2016; Burgess et al., 2019; Greff et al., 2016; 2017; Locatello et al., 2020; Engelcke et al., 2019; 2021) and, in some cases, generation (e.g. Engelcke et al., 2019; 2021) . In contrast, unsupervised methods for object-centric scene decomposition operating directly on 3D inputs remain comparatively unexplored (Elich et al., 2022; Stelzner et al., 2021) -despite the benefits due to the added information contained in the input. As a case in point, Stelzner et al. (2021) recently established that access to 3D information significantly speeds up learning. Another benefit is that, for the parts of an object visible to a 3D sensor, object shape is readily accessible and does not have to be inferred, either from a single view (e.g. Liu et al., 2019; Kato et al., 2018) or from multiple views (e.g. Yu et al., 2021; Xie et al., 2019) . We conjecture that object shape can serve as a highly informative inductive bias for object-centric learning. As we elaborate below, we reason that the asymmetry of a shape can be used to discover an object's pose, and pose can help to identify and locate an object in space. Here we present OBPOSE, an unsupervised OCGM that takes RGB-D images (or video) as input and learns to segment the underlying scene into its constituent 3D objects, as well as into an explicit background representation. As we will show, OBPOSE can also be used for scene generation and editing. Inspired by prior art in 2D settings (Eslami et al., 2016; Crawford & Pineau, 2019; Lin et al., 2020; Kosiorek et al., 2018; Jiang et al., 2019; Wu et al., 2021) , OBPOSE factorises its latent embedding into a component capturing an object's location and appearance (where and what components, respectively). This factorisation provides a strong inductive bias, helping the model to disentangle its input into meaningful concepts for downstream use. A key contribution of OBPOSE is the introduction of pose (i.e. location and orientation) as a novel inductive bias. OBPOSE is not a pose-estimation model. Rather, OBPOSE infers pose information from an object's shape to reduce apparent variance and to simplify the learning of the model's what component in 3D (see fig. 11 in the appendix) -ultimately for use in downstream tasks like segmentation and scene editing. In an OGGM, the intuition is that the variance we observe in an object's pose should be made invariant in the latent space. OBPOSE infers pose information without supervision, using a minimum volume principle defined using the tightest bounding box that constrains the object. Effectively, the tightest bounding box will reveal asymmetries in an object's shape, if there are any, which can be used to constrain the object's orientation. We further propose a voxelised approximation approach that recovers an object's shape in a computationally tractable way from a neural radiance field (NeRF) (Mildenhall et al., 2020) . Although the recovery of an object's shape from a NeRF can be prohibitively expensive, our approach allows this to be integrated efficiently into the OBPOSE training loop. In a series of experiments, OBPOSE outperforms the current state-of-the-art in 3D scene inference, ObSuRF (Stelzner et al., 2021) , by significant margins. Evaluations are performed on the CLEVR dataset (Johnson et al., 2017) , the MultiShapeNet dataset (Stelzner et al., 2021; Chang et al., 2015) , and the YCB dataset for unsupervised scene segmentation (Calli et al., 2015) , in the latter case using both RGB-D moving-objects (video) and multi-view static scenes. An ablation study on the OBPOSE encoder serves to validate the design decisions that distinguish its use of attention from alternative attention mechanisms represented by Slot Attention (Locatello et al., 2020) and GENESIS-v2 (Engelcke et al., 2021) . In summary, the key contributions of this paper are: (1) a new state-of-the-art unsupervised scene segmentation model for 3D, OBPOSE, together with insights into its design decisions; (2) a novel inductive bias for 3D OCGMs, pose, together with its motivation; and (3) a general method for fast shape evaluation from NeRFs. 



Figure1: Overview. Front-view observations can be fed into OBPOSE (e.g. video of objects on a table, top left). OBPOSE then reconstructs the sparse voxelised point cloud and normalised pose of each object (two objects shown, top right). Each point within a slot is coloured to represent a specific object. Note that while an object may move in the world frame (middle row), it appears stationary when viewed from the perspective of its normalised frame (bottom row).

