LEARNING TO PERCEIVE OBJECTS BY PREDICTION

Abstract

The representation of objects is the building block of higher-level concepts. Infants develop the notion of objects without supervision, for which the prediction error of future sensory input is likely a major teaching signal. We assume that the goal of representing objects distinctly is to allow the prediction of the coherent motion of all parts of an object independently from the background while keeping track of relatively fewer parameters of the object's motion. To realize this, we propose a framework to extract object-centric representations from single 2D images by learning to predict future scenes containing moving objects. The model learns to explicitly infer objects' locations in a 3D environment, generate 2D segmentation masks of objects, and perceive depth. Importantly, the model requires no supervision or pre-training but assumes rigid-body motion and only needs observer's self-motion at training time. Further, by evaluating on a new synthetic dataset with more complex textures of objects and background, we found our model overcomes the reliance on clustering colors for segmenting objects, which is a limitation for previous models not using motion information. Our work demonstrates a new approach to learning symbolic representation grounded in sensation and action.

1. INTRODUCTION

Visual scenes are composed of various objects in front of backgrounds. Discovering objects from 2D images and inferring their 3D locations is crucial for planning actions in robotics (Devin et al., 2018; Wang et al., 2019) and this can potentially provide better abstraction of the environment for reinforcement learning (RL), e.g. Veerapaneni et al. (2020) . The appearance and spatial arrangement of objects, together with the lighting and the viewing angle, determine the 2D images formed on the retina or a camera. Therefore, objects are latent causes of 2D images, and discovering objects is a process of inferring latent causes (Kersten et al., 2004) . The predominant approach in computer vision for identifying and localizing objects rely on supervised learning to infer bounding boxes (Ren et al., 2015; Redmon et al., 2016) or pixel-level segmentation of objects (Chen et al., 2017) . However, the supervised approach requires expensive human labeling. It is also difficult to label every possible category of objects. Therefore, an increasing interest has developed recently in the domain of object-centric representation learning (OCRL) to build unsupervised or self-supervised models to infer objects from images, such as MONet (Burgess et al., 2019) The majority of the early OCRL works are demonstrated on relatively simple scenes with objects of pure colors and background lacking complex textures. As recently pointed out, the success of several recent models based on a variational auto-encoder (VAE) architecture (Kingma & Welling, 2013; Rezende et al., 2014) depends on a capacity bottleneck that needs to be intricately balanced against a reconstruction loss(Engelcke et al., 2020). Potentially due to the lack of sufficient inductive bias existing in real-world environments, such methods often fail in scenes with complex textures on objects and background (Greff et al., 2019) . To overcome this limitation, recent works utilized optical flow (either ground truth or estimated) as prediction target (Kipf et al., 2021) because optical flow is often coherent within objects and distinct from the background. Similarly, depth often exhibits sharp changes across object boundaries. When used as an additional prediction target, it further improves segmentation performance of the slot-attention model (Elsayed et al., 2022) . Although these new prediction targets allow models to perform better in unsupervised object segmentation in realistic environments with complex textures, these models still pose sharp contrast to the learning



, IODINE (Greff et al., 2019) slot-attention (Locatello et al., 2020), GENESIS (Engelcke et al., 2019; 2021), C-SWM (Kipf et al., 2019), mulMON (Nanbo et al., 2020) and SAVi++(Elsayed et al., 2022).

