LEARNING WHAT AND WHERE: DISENTANGLING LOCATION AND IDENTITY TRACKING WITHOUT SUPERVISION

Abstract

Our brain can almost effortlessly decompose visual data streams into background and salient objects. Moreover, it can anticipate object motion and interactions, which are crucial abilities for conceptual planning and reasoning. Recent object reasoning datasets, such as CATER, have revealed fundamental shortcomings of current vision-based AI systems, particularly when targeting explicit object representations, object permanence, and object reasoning. Here we introduce a self-supervised LOCation and Identity tracking system (Loci), which excels on the CATER tracking challenge. Inspired by the dorsal and ventral pathways in the brain, Loci tackles the binding problem by processing separate, slot-wise encodings of 'what' and 'where'. Loci's predictive coding-like processing encourages active error minimization, such that individual slots tend to encode individual objects. Interactions between objects and object dynamics are processed in the disentangled latent space. Truncated backpropagation through time combined with forward eligibility accumulation significantly speeds up learning and improves memory efficiency. Besides exhibiting superior performance in current benchmarks, Loci effectively extracts objects from video streams and separates them into location and Gestalt components. We believe that this separation offers a representation that will facilitate effective planning and reasoning on conceptual levels. 1

1. INTRODUCTION

Human perception is characterized by segmenting scenes into individual entities and their interactions [4; 13; 25] . This ability poses a non-trivial challenge for computational models of cognition [37; 84] : the binding problem [79] . Visual features need to be selectively bound into single objects, segregated from the background, and encoded by means of compressed stable neural attractors [5; 44; 46; 73] . Recent years have seen revolutionary progress in the ability of connectionist models to operate on complex natural images and videos [31; 50; 65] . Yet, neural network models still do not fully solve the binding problem [37] . Indeed, recent work on synthetic video-based reasoning datasets, like CLEVR, CLEVRER, or CATER [32; 43; 93] , suggests that state-of-the-art systems [16; 89; 92] still struggle to model fundamental physical object properties, such as hollowness, blockage, or object permanence-concepts that children learn to master in the first few months of their lives [2; 63]. In a comprehensive review on the binding problem in the context of neural networks, Greff et al. [37] define three main challenges for solving the problem: representation, segregation, and composition. Representation refers to the challenge to effectively represent the essential properties of an object, including its appearance and potential dynamics. We will refer to these properties as the 'Gestalt' of an object [48; 86; 87] . Moreover, the individual objects' locations and motions dynamics should be disentangled from their Gestalt to enable compositional recombinations. Meanwhile, the representations should share a common format to enable general purpose reasoning. Segregation describes the challenge to extract particular objects from a perceived scene. This extraction should be done contextand task-dependently to identify the currently relevant entities. As a result, a good segregation should enable effective dynamic predictions of the whole, rather than only the parts. Finally, composition characterizes the challenge to develop object representations that enable meaningful re-combinations of object properties-particularly those that facilitate the prediction of object interaction dynamics. As a result, compositional representations enable conceptual reasoning about object properties as well as relations and interactions between objects. We introduce a novel Location and Identity tracking system. While observing videos, Loci disentangles object identities ('what') from their spatial properties ('where') in a fully unsupervised, autoregressive manner. It is motivated by our brain's ventral and dorsal processing pathways [67; 80]. Loci's key contribution lies in how object-specific information is disentangled and recombined: (i) Loci fosters slot-respective object persistence over time via a novel combination of slotspecific input channels, temporal slot-interactive predictions via self-attention [83] followed by GateL0RD-RNN [39], and object permanence-oriented loss regularization. (ii) Our slot-decoding strategy combines object-specific Gestalt codes with parameterized Gaussians in a, to the best of our knowledge, novel manner. This combination fosters the emergent explication of an object's size, its position, and current occlusions. As a main result, we observe superior performance on the CATER benchmark: Loci outperforms previous methods by a large margin with an order of magnitude fewer parameters. Additional evaluations on moving MNIST, an aquarium video footage, and on the CLEVRER benchmark underline Loci's contribution towards the self-organized, disentangled identification and localization of objects as well as the effective processing of object interaction dynamics from video data.

2. RELATED WORK

Previous work by [59] has emphasized that, in general, unsupervised object representation learning is impossible because infinitely many variable models are consistent with the data. Inductive biases are thus necessary to ensure the effective learning of a system that segregates a visual stream of information into effective, compositional representations [37] . Accordingly, we review related work in the light of the binding problem and their relation to the proposed Loci system. Representation A powerful choice of an encoding format is the formulation of 'slots', which share the encoding module but keep the codes of individual objects separate from one another. To ensure a common format between the slot-wise encodings, typically, slot-respective encoder modules share their weights [8; 60; 83] . To assign individual objects to individual slots, though, the system needs to break slot symmetry. Recurrent neural networks have been used to disentangle encodings or assignments [11; 28; 29; 35; 60] . Other mechanisms explicitly separate spatial slot locations [20; 42; 54] , which we also do in Loci. However, instead of treating every spatial location as a potential object, each slot has a spotlight, which is designed to approximate the object's center. To further foster a compositional object representation, Loci enforces disentanglement of 'what' from 'where' by separating an object's Gestalt code-mainly representing shape and surface pattern-from its location, size (visual extent), and priority (current visibility with respect to other objects). This stronger disentanglement and more complex 'where' representation is related to work that models selective visual attention, realizing partially size-invariant tracking of one particular entity, such as a pedestrian [22; 45; 72] . Advancing this work, Loci tracks multiple objects in parallel, imposes interactive, object-specific spot-lights, and enables more compressed, object-specific appearance representations due to its novel way of combining 'what' and 'where' for decoding. Segregation Segregating object instances from images is traditionally solved via bounding box detection [58; 74] , where more advanced techniques extract additional masks for instance segmentation [15; 21; 40] . Through slot-attention mechanisms, recent unsupervised approaches partition



Source Code: https://github.com/CognitiveModeling/Loci



(iii) We improve sample and memory efficiency by training Loci's recurrent modules by means of time-local backpropagation combined with forward propagation of eligibility traces.

