LEARNING WHAT AND WHERE: DISENTANGLING LOCATION AND IDENTITY TRACKING WITHOUT SUPERVISION

Abstract

Our brain can almost effortlessly decompose visual data streams into background and salient objects. Moreover, it can anticipate object motion and interactions, which are crucial abilities for conceptual planning and reasoning. Recent object reasoning datasets, such as CATER, have revealed fundamental shortcomings of current vision-based AI systems, particularly when targeting explicit object representations, object permanence, and object reasoning. Here we introduce a self-supervised LOCation and Identity tracking system (Loci), which excels on the CATER tracking challenge. Inspired by the dorsal and ventral pathways in the brain, Loci tackles the binding problem by processing separate, slot-wise encodings of 'what' and 'where'. Loci's predictive coding-like processing encourages active error minimization, such that individual slots tend to encode individual objects. Interactions between objects and object dynamics are processed in the disentangled latent space. Truncated backpropagation through time combined with forward eligibility accumulation significantly speeds up learning and improves memory efficiency. Besides exhibiting superior performance in current benchmarks, Loci effectively extracts objects from video streams and separates them into location and Gestalt components. We believe that this separation offers a representation that will facilitate effective planning and reasoning on conceptual levels. 1

1. INTRODUCTION

Human perception is characterized by segmenting scenes into individual entities and their interactions [4; 13; 25] . This ability poses a non-trivial challenge for computational models of cognition [37; 84]: the binding problem [79] . Visual features need to be selectively bound into single objects, segregated from the background, and encoded by means of compressed stable neural attractors [5; 44; 46; 73] . Recent years have seen revolutionary progress in the ability of connectionist models to operate on complex natural images and videos [31; 50; 65] . Yet, neural network models still do not fully solve the binding problem [37] . Indeed, recent work on synthetic video-based reasoning datasets, like CLEVR, CLEVRER, or CATER [32; 43; 93] , suggests that state-of-the-art systems [16; 89; 92] still struggle to model fundamental physical object properties, such as hollowness, blockage, or object permanence-concepts that children learn to master in the first few months of their lives [2; 63]. In a comprehensive review on the binding problem in the context of neural networks, Greff et al. [37] define three main challenges for solving the problem: representation, segregation, and composition. Representation refers to the challenge to effectively represent the essential properties of an object, including its appearance and potential dynamics. We will refer to these properties as the 'Gestalt' of an object [48; 86; 87] . Moreover, the individual objects' locations and motions dynamics should be



Source Code: https://github.com/CognitiveModeling/Loci 1

