LEARNING TO PERCEIVE OBJECTS BY PREDICTION

Abstract

The representation of objects is the building block of higher-level concepts. Infants develop the notion of objects without supervision, for which the prediction error of future sensory input is likely a major teaching signal. We assume that the goal of representing objects distinctly is to allow the prediction of the coherent motion of all parts of an object independently from the background while keeping track of relatively fewer parameters of the object's motion. To realize this, we propose a framework to extract object-centric representations from single 2D images by learning to predict future scenes containing moving objects. The model learns to explicitly infer objects' locations in a 3D environment, generate 2D segmentation masks of objects, and perceive depth. Importantly, the model requires no supervision or pre-training but assumes rigid-body motion and only needs observer's self-motion at training time. Further, by evaluating on a new synthetic dataset with more complex textures of objects and background, we found our model overcomes the reliance on clustering colors for segmenting objects, which is a limitation for previous models not using motion information. Our work demonstrates a new approach to learning symbolic representation grounded in sensation and action.

1. INTRODUCTION

Visual scenes are composed of various objects in front of backgrounds. Discovering objects from 2D images and inferring their 3D locations is crucial for planning actions in robotics (Devin et al., 2018; Wang et al., 2019) and this can potentially provide better abstraction of the environment for reinforcement learning (RL), e.g. Veerapaneni et al. (2020) . The appearance and spatial arrangement of objects, together with the lighting and the viewing angle, determine the 2D images formed on the retina or a camera. Therefore, objects are latent causes of 2D images, and discovering objects is a process of inferring latent causes (Kersten et al., 2004) . The predominant approach in computer vision for identifying and localizing objects rely on supervised learning to infer bounding boxes (Ren et al., 2015; Redmon et al., 2016) or pixel-level segmentation of objects (Chen et al., 2017) . However, the supervised approach requires expensive human labeling. It is also difficult to label every possible category of objects. Therefore, an increasing interest has developed recently in the domain of object-centric representation learning (OCRL) to build unsupervised or self-supervised models to infer objects from images, such as MONet (Burgess et al., 2019) , IODINE (Greff et al., 2019 ) slot-attention (Locatello et al., 2020) , GENESIS (Engelcke et al., 2019; 2021) , C-SWM (Kipf et al., 2019 ), mulMON (Nanbo et al., 2020) and SAVi++(Elsayed et al., 2022) . The majority of the early OCRL works are demonstrated on relatively simple scenes with objects of pure colors and background lacking complex textures. As recently pointed out, the success of several recent models based on a variational auto-encoder (VAE) architecture (Kingma & Welling, 2013; Rezende et al., 2014) depends on a capacity bottleneck that needs to be intricately balanced against a reconstruction loss( Engelcke et al., 2020) . Potentially due to the lack of sufficient inductive bias existing in real-world environments, such methods often fail in scenes with complex textures on objects and background (Greff et al., 2019) . To overcome this limitation, recent works utilized optical flow (either ground truth or estimated) as prediction target (Kipf et al., 2021) because optical flow is often coherent within objects and distinct from the background. Similarly, depth often exhibits sharp changes across object boundaries. When used as an additional prediction target, it further improves segmentation performance of the slot-attention model (Elsayed et al., 2022) . Although these new prediction targets allow models to perform better in unsupervised object segmentation in realistic environments with complex textures, these models still pose sharp contrast to the learning ability of the brain: neither depth nor optical flow are available as external input to the brain, yet infants can learn to understand the concept of object by 8 months old (Piaget & Cook, 1952; Flavell, 1963) , with other evidence suggesting this may be achieved as early as 3.5-4.5 months (Baillargeon, 1987) . The fact that this ability develops before they can name objects (around 12 months old) without supervision confirms the importance of learning object-centric representation for developing higher-level concepts and the gap of current models from the brain. To narrow this gap, this paper starts with considering the constraints faced by the brain and proposes a new architecture and learning objective using signals similar to what the brain has access to. As the brain lacks direct external supervision for object segmentation, the most likely learning signal is from the error of predicting the future. In the brain, a copy of the motor command (efference copy) is sent from the motor cortex simultaneously to the sensory cortex, which is hypothesized to facilitate the prediction of changes in sensory input due to self-generated motion (Feinberg, 1978) . What remains to be predicted are changes in visual input due to the motion of external objects. Therefore, we assume that the functional purpose of grouping pixels into objects is to allow the prediction of the motion of the constituting pixels in an object in a coherent way by tracking very few parameters (e.g., the location, pose, and speed of an object). Driven by this hypothesis, our contribution in this paper is: (1) we combine predictive learning and explicit 3D motion prediction to learn 3Daware object-centric representation from RGB image input without any supervision or pre-training, which we call Object Perception by Predictive LEarning (OPPLE); (2) we provide a new datasetfoot_0 with complex surface texture and motion by both the camera and objects to evaluate object-centric representation models; we confirm that several previous models overly rely on clustering colors to segment objects on this dataset; (3) although our model leverages image prediction as a learning objective, the architecture generalizes the ability of object segmentation and spatial localization to single-frame images.

2. METHOD

Here, we outline our problem statement then explain details of our model parts, the prediction approach, and the learning objective. Pseudocode for our algorithm and the details of implementation are provided in the appendix (A.1, A.2).

2.1. PROBLEM FORMULATION

We denote a scene as a set of distinct objects and a background S = {O 1 , O 2 , . . . , O K , B}, where K is the number of objects in scene. At any moment t, we denote two state variables, the location and pose of each object from the perspective of an observer (camera), as x k is its yaw angle from a canonical pose, as viewed from the reference frame of the camera (for simplicity, we do not consider pitch and roll here and leave it for future work to extend to 3D pose). At time t, given the location of the camera o (t) ∈ R 3 and its facing direction α (t) , S renders a 2D image on the camera as I (t) ∈ R w×h×3 , where w × h is the size of the image. Our goal is to develop a neural network that infers properties of objects given only a single image I (t) as the sole input without external supervision and with only the information of the intrinsics and ego-motion of the camera: {z (t) 1:K , π (t) 1:K+1 , x(t) 1:K , p ϕ 1:K } = f obj (I (t) ) Here, z 1:K is a set of view-invariant vectors representing the identity of each object k. "Viewinvariant" is loosely defined as |z (t) k -z (t+∆t) k | < |z (t) k -z (t) l | for k ̸ = l and ∆t > 0 in most cases, i.e., the vector codes are more similar for the same object across views than they are for different objects. π (t) 1:K+1 ∈ R (K+1)×w×h are the probabilities that each pixel belongs to any of the objects or the background ( k π kij = 1 for any pixel at i, j), which achieves object segmentation. To localize objects, x(t) 1:K are the estimated 3D locations of each object relative to the observer and p 



We will release upon publication of the paper



d coordinate of the k-th object and ϕ (t)

K are the estimated probability distributions of the poses of each object. Each p (t) ϕ k ∈ R b is a probability distribution over b equally-spaced bins of yaw angles in (0, 2π).

