LEARNING WHAT AND WHERE: DISENTANGLING LOCATION AND IDENTITY TRACKING WITHOUT SUPERVISION

Abstract

Our brain can almost effortlessly decompose visual data streams into background and salient objects. Moreover, it can anticipate object motion and interactions, which are crucial abilities for conceptual planning and reasoning. Recent object reasoning datasets, such as CATER, have revealed fundamental shortcomings of current vision-based AI systems, particularly when targeting explicit object representations, object permanence, and object reasoning. Here we introduce a self-supervised LOCation and Identity tracking system (Loci), which excels on the CATER tracking challenge. Inspired by the dorsal and ventral pathways in the brain, Loci tackles the binding problem by processing separate, slot-wise encodings of 'what' and 'where'. Loci's predictive coding-like processing encourages active error minimization, such that individual slots tend to encode individual objects. Interactions between objects and object dynamics are processed in the disentangled latent space. Truncated backpropagation through time combined with forward eligibility accumulation significantly speeds up learning and improves memory efficiency. Besides exhibiting superior performance in current benchmarks, Loci effectively extracts objects from video streams and separates them into location and Gestalt components. We believe that this separation offers a representation that will facilitate effective planning and reasoning on conceptual levels. 1

1. INTRODUCTION

Human perception is characterized by segmenting scenes into individual entities and their interactions [4; 13; 25] . This ability poses a non-trivial challenge for computational models of cognition [37; 84] : the binding problem [79] . Visual features need to be selectively bound into single objects, segregated from the background, and encoded by means of compressed stable neural attractors [5; 44; 46; 73] . Recent years have seen revolutionary progress in the ability of connectionist models to operate on complex natural images and videos [31; 50; 65] . Yet, neural network models still do not fully solve the binding problem [37] . Indeed, recent work on synthetic video-based reasoning datasets, like CLEVR, CLEVRER, or CATER [32; 43; 93] , suggests that state-of-the-art systems [16; 89; 92] still struggle to model fundamental physical object properties, such as hollowness, blockage, or object permanence-concepts that children learn to master in the first few months of their lives [2; 63] . In a comprehensive review on the binding problem in the context of neural networks, Greff et al. [37] define three main challenges for solving the problem: representation, segregation, and composition. Representation refers to the challenge to effectively represent the essential properties of an object, including its appearance and potential dynamics. We will refer to these properties as the 'Gestalt' of an object [48; 86; 87] . Moreover, the individual objects' locations and motions dynamics should be disentangled from their Gestalt to enable compositional recombinations. Meanwhile, the representations should share a common format to enable general purpose reasoning. Segregation describes the challenge to extract particular objects from a perceived scene. This extraction should be done contextand task-dependently to identify the currently relevant entities. As a result, a good segregation should enable effective dynamic predictions of the whole, rather than only the parts. Finally, composition characterizes the challenge to develop object representations that enable meaningful re-combinations of object properties-particularly those that facilitate the prediction of object interaction dynamics. As a result, compositional representations enable conceptual reasoning about object properties as well as relations and interactions between objects. We introduce a novel Location and Identity tracking system. While observing videos, Loci disentangles object identities ('what') from their spatial properties ('where') in a fully unsupervised, autoregressive manner. It is motivated by our brain's ventral and dorsal processing pathways [67; 80] . Loci's key contribution lies in how object-specific information is disentangled and recombined: (i) Loci fosters slot-respective object persistence over time via a novel combination of slotspecific input channels, temporal slot-interactive predictions via self-attention [83] followed by GateL0RD-RNN [39] , and object permanence-oriented loss regularization. (ii) Our slot-decoding strategy combines object-specific Gestalt codes with parameterized Gaussians in a, to the best of our knowledge, novel manner. This combination fosters the emergent explication of an object's size, its position, and current occlusions. (iii) We improve sample and memory efficiency by training Loci's recurrent modules by means of time-local backpropagation combined with forward propagation of eligibility traces. As a main result, we observe superior performance on the CATER benchmark: Loci outperforms previous methods by a large margin with an order of magnitude fewer parameters. Additional evaluations on moving MNIST, an aquarium video footage, and on the CLEVRER benchmark underline Loci's contribution towards the self-organized, disentangled identification and localization of objects as well as the effective processing of object interaction dynamics from video data.

2. RELATED WORK

Previous work by [59] has emphasized that, in general, unsupervised object representation learning is impossible because infinitely many variable models are consistent with the data. Inductive biases are thus necessary to ensure the effective learning of a system that segregates a visual stream of information into effective, compositional representations [37] . Accordingly, we review related work in the light of the binding problem and their relation to the proposed Loci system. Representation A powerful choice of an encoding format is the formulation of 'slots', which share the encoding module but keep the codes of individual objects separate from one another. To ensure a common format between the slot-wise encodings, typically, slot-respective encoder modules share their weights [8; 60; 83] . To assign individual objects to individual slots, though, the system needs to break slot symmetry. Recurrent neural networks have been used to disentangle encodings or assignments [11; 28; 29; 35; 60] . Other mechanisms explicitly separate spatial slot locations [20; 42; 54] , which we also do in Loci. However, instead of treating every spatial location as a potential object, each slot has a spotlight, which is designed to approximate the object's center. To further foster a compositional object representation, Loci enforces disentanglement of 'what' from 'where' by separating an object's Gestalt code-mainly representing shape and surface pattern-from its location, size (visual extent), and priority (current visibility with respect to other objects). This stronger disentanglement and more complex 'where' representation is related to work that models selective visual attention, realizing partially size-invariant tracking of one particular entity, such as a pedestrian [22; 45; 72] . Advancing this work, Loci tracks multiple objects in parallel, imposes interactive, object-specific spot-lights, and enables more compressed, object-specific appearance representations due to its novel way of combining 'what' and 'where' for decoding. Segregation Segregating object instances from images is traditionally solved via bounding box detection [58; 74] , where more advanced techniques extract additional masks for instance segmentation [15; 21; 40] . Through slot-attention mechanisms, recent unsupervised approaches partition image regions into separate slots to represent distinct objects [36; 61] . To segregate objects from images, Burgess et al. [10] and von Kügelgen et al. [85] combine a soft attention (mask) approach with encoder, recurrent attention, and decoder modules, where slots compete for encoding individual objects. Loci pursues a similar approach but segregates objects even further to encourage objectrespective 'what' and 'where' representations, where the latter additionally disentangle location, size, and approximate depth. Slot-respective masks compete via softmax attention to actively minimize the prediction error of the visual content. Moreover, a pre-trained background model separates potentially interesting from uninteresting regions. While we keep the background modeling rather simple in this work, more advanced techniques may certainly be applied in the near future [26; 68; 69; 82] . Composition Compositional reasoning in our model builds on two modules, which process objectto-object interactions and object dynamics. Object-to-object interactions are modelled using Multi-Head Self-Attention (MHSA) [83] , in close relation to [27; 47] . Others have employed message passing neural networks to simulate object interactions [6; 18; 41] . Another promising approach uses Neural Production Systems to dynamically select a learnable rule to model object interactions [23] . Object dynamics are modelled using a recurrent GateL0RD module [39] . GateL0RD is designed to apply latent state updates sparsely, which encourages the maintenance of stable object encodings over long periods of time. Previous approaches have also employed recurrent structures to propagate slotwise encodings forward in time [42; 49] . Although some previous works have combined recurrent structures with attention [33; 34] , recent slot-attention frameworks tend to adopt fully auto-regressive designs based on transformers without explicit internal state maintenance [27; 47; 64] . Tracking models While our primary goal is to separate object location and Gestalt representations, extracted object locations likely facilitate object tracking; a wide area of research on its own [91] . State-of-the-art object tracking methods rely on features extracted via attention modules, which are typically applied autoregressively on individual frames [19; 64] . Again, these approaches do not explicitly foster a separation of 'what' from 'where', which may limit their applicability and accuracy in distractive environments, where humans still maintain high tracking skills, as shown in [55] . Loci maintains separate object encodings and thus copes with the presence of distractors more readily. The tracking model proposed in [55] is similar in spirit to Loci, extracting object encodings and propagating them with a competition mechanism and a recurrent module. However, to model complex objects their model relies on additional supervision. Disentangling movement and appearance is a common principle in video models with a notion of optical flow [57] . But optical flow based methods are not designed to deal with occlusions as they only represent parts of the scene that are currently visible. In contrast, Loci is able to maintain the encoding of an object even under occlusion. The CATER benchmark We evaluate Loci mainly on the CATER challenge [32] . Previous SOTA methods on this challenge, like Multi-Hopper or OpNet [76; 96] , have reused well-established neural network building blocks. Others have attempted to build their system on top of a supervised pre-trained bounding-box-based object detector [15; 97] . Most recently, current best results were reported using a video-level TFCNet [95] , which makes efficient use of spatio-temporal convolutions on long videos. In contrast, Loci effectively combines an entity-oriented representational format with several kinds of interactive neural processing modules. While Loci disentangles location from identity when trained in a fully self-supervised manner, it can be further fine-tuned with supervision to minimize location-specific prediction error. Due to the employed inductive biases, Loci yields superior performance with a fraction of the number of parameters used in other models.

3. METHODS

The core idea of Loci involves two major aspects: First, distributing the visual input across recruitable slot-modules, which try to explain the ongoing scene dynamics in a compositional, mutually exclusive manner. A similar principle was shown to be a powerful inductive bias for emergent time series decomposition [70] . Second, splitting the encoding of a moving entity within a slot-module into its 'what' and 'where' components. These components are represented as a latent Gestalt vector ('what'), which can intuitively be understood as a symbolic encoding of an object's appearance, including its shape, color, size, texture, and other kinds of visual properties, and a latent spatial code ('where'), which explicates location, size, and relative priority [67; 80] . Moreover, we use the Siamese network approach implemented in slot-attention [60] , sharing weights across slots. In fact, Loci employs the same encoding and same decoding network for every slot starting from the raw image. Each encoder slot k generates output Gestalt vectors G k and position codes P k . Entity dynamics and interactions are then processed based on these pairs of latent Gestalt and location encodings. We predict the future states of latent encodings in a transition module and allow cross-slot interactions via multi-head self-attention. Figure 1 shows the main components of Loci. Information processing in one iteration unfolds from left to right. A detailed algorithmic description is provided in Appendix B. Here, we provide an overview and then detail the loss computation and training procedure.

3.1. LOCI OVERVIEW

The slot-wise encoder is based on a conventional ResNet architecture. The input consists of eight main features, which encode the image frame I t at current iteration t, the prediction error map E t , a common background mask M t bg as well as, specifically for every slot k, a content mask M t k , a position map Qt k , and maps for both the other masks M t,s k and the other positions Qt,s k (cf. Figure 1 ). The computation path of Gestalt and position for an individual slot is tree-shaped, starting with a shared ResNet-trunk after which the path is split into four pathways, which compute Gestalt, location, size, and priority. This encoder design encourages that separately moving entities are encoded in individual ResNet activity dynamics as well as that each slot encodes 'what' (Gestalt) and 'where' (location, size, and priority) of the slot-encoded moving entity separately but conjointly. Thereby, the error map E t encourages active error minimization, following the principle of predictive coding fostering attractor dynamics [44; 73] . Appendix B specifies further details. The transition module then predicts location P t k and Gestalt dynamics G t k as well as interactions between slots. The module is loosely based on the architecture of a transformer-like encoder [83] , where several multi-head attention layers are stacked together with residual feed forward layers in between. We replace the residual feed forward layers with residual GateL0RD [39] layers. GateL0RD is a recent gated recurrent neural network module, which is very well-suited to learn and distinctively encode interaction events, which have been described as "structured, describable, memorable units of experience" [3] , cf. [94] . In accordance with our entity-focused processing approach, we apply a Siamese GateL0RD version, operating on the individual slot level while receiving information form other slots via the attention layers. At the end of the transition module, the Gestalt code is pushed through a binarization layer, inspired by the principle of vector quantization [81] . This layer enforces an information bottleneck and thus contributes to the development of disentangled entity codes. The slot-wise decoder recombines the 'what' and 'where' output from the transition module. To do so, first the potential influence of each Gestalt code G k is computed over the output range by means of a 2d isotropic Gaussian parameterized by the position code P k , yielding density maps Q k . Next, a priority-ẑ-based attention is applied to account for the fact that only one slot-object can be visible at any location (transparent objects are left for future work, see Algorithm-2 in Appendix B). As a result, when two slots cover the same location, the one with the lower priority will have its feature maps set to zero. The rest of the decoder is based on a conventional ResNet, which increases the resolution back to the video size. Like the encoder, the decoder shares weights across all slots. Eventually, the decoder outputs predictions of RGB entity reconstructions Rt+1 k and individual mask predictions M t+1 k for each slot k. The masks from all slots and from the background are then combined to construct the final output image prediction Rt+1 and competitive mask predictions M t+1 k . To fully reconstruct the image, Loci uses its simple background module, which generates background image estimates Rbg and a background mask Mbg . In case of a static background, a Gaussian mixture model over the whole training set is used and a flat background mask set to the bias value θ bg . For more dynamic backgrounds, we employ a simple auto encoder.

3.2. TRAINING

Loci is trained using a binary cross-entropy loss (L BCE ) pixel-wise on the frame prediction applying rectified Adam optimization [RAdam, cf. 56]. Several regularizing losses are added to foster object permanence. Additionally, to speed-up learning, we use truncated backpropagation trough time and enhance the gradients with an e-prop-like accumulation of previous neural activities [7] . Empirical evaluations showed that backpropagating the gradients through the networks' inputs creates instabilities in the training process. Thus, we detach the gradients for the latent states. As a result, the only part of the network that needs to backpropagate gradients trough time are the GateL0RD layers. Here, we found that using the described combination of e-prop and backpropagation is not only comparable in terms of network accuracy, it also greatly decreases training wall-clock time, as it allows the use of truncated backpropagation with length 1, effectively updating the weights after each input frame (see supplementary material for details). Another important aspect for successful training is the use of a warmup phase, where we mask the target of the network with a foreground mask computed with a threshold τ : M = τ < Mean I t -Rbg 2 , axis = 'rgb' , where Rbg is the background model estimate detailed above. A black background is used instead of Rbg to construct the next frame prediction during this phase. The foreground mask together with the zeroed background enforces the network to focus on reconstructing the foreground objects only. This encourages the usage of the position-constrained encoding and decoding mechanism. After about 30 000 updates-when the network has sufficiently learned to use the position encodings-we switch from the masked foreground reconstruction to the full reconstruction. Object Permanence Loss To encourage object permanence, an additional loss is computed based on Equation 2, which favors a slot that keeps encoding the same object, even if the object is temporarily occluded and thus invisible: L o = k D k (P t k , Ĝt k ) -D k (P t k , G t k ) 2 , Ĝt k = Ĝt-1 k (1 -max(M t k )) + G t k max(M t k ), where P k and G k denote location and Gestalt encoding in slot k, D k refers to the RGB part of the decoder network, while Ĝt k denotes the Gestalt code averaged around the last time step in which the entity was visible, and M t k denotes the mask of slot k at time step t. As a result, L o is only applied when the object becomes invisible, which is the case when M t k is approaching zero. Time Persistence Loss A second mechanism to enforce object permanence and to also regularize the network towards temporal consistent slot encodings, is a time persistent regularization loss: L t = 0.1 k D k (p 0 , G t-1 k ) -D k (p 0 , G t k ) 2 , where again D k refers to the RGB part of the decoder network and p 0 is the center position in the image. L t essentially penalizes large visual changes in the decoded object between two consecutive time steps. Position Change Loss In order to encourage the network to predict small slot-position changes over time, a simple L 2 regularization loss is based on the position change between two time steps: L p = 0.01 n (P t-1 k -P t k ) 2 Supervision Loss Finally, for the experiments that are using a supervised target object location signal for fine-tuning, detailed in Equation 6, we added a gating network Φ that operates on the latent Gestalt codes G t k before binarization and predicts a single softmax probability, which is used to decide which entity corresponds to the Snitch in the CATER dataset. The location of the selected entity is then used in an L 2 -loss to foster regression to the target location provided in the dataset. L s = µ s k P t k Φ( Ĝt k ) -p t snitch 2 The final loss for the network results from adding the individual loss components, denoting the unsupervised and supervised losses, respectively: L unsup = L BCE + L o + L t + L p , L sup = L BCE + L o + L t + L p + L s .

4. EXPERIMENTS & RESULTS

We performed unsupervised training in all experiments. Only for the CATER challenge we additionally evaluated a version where we fine-tuned the network via supervision using Equation 6.

4.1. CATER-TRACKING CHALLENGE

In the CATER tracking challenge (cf. Figure 2 ), the task is to locate a unique yellow object, called Snitch, at the end of a video sequence. During the video, different geometric objects rotate, move, Figure 2 : The CATER Snitch localization challenge: The task is to locate the yellow spherical object called Snitch within the last frame. The challenge is that the Snitch might be contained and moved by a cone. So its location has to be inferred by recognizing and remembering a containment event and then tracking the position of the container. Image adapted from [32] . and hop in a scripted random way. By doing so, cone objects can engulf other objects and move them to another location before releasing them again, which can lead to situations where the Snitch remains hidden in the last frame of the video. Therefore, the challenge is not only to recognize containment events, but also to track the specific cone that contains the Snitch. For classification purposes, the 3D Snitch position is partitioned into a 6 × 6 grid, resulting in a total of 36 classes. In order to account for the fact that a small location error could lead to miss-classification, a common reported metric is the L 1 grid distance. We additionally report the continuous L 2 distance with grid length 1, such that the distance is comparable to, but more informative than the L 1 grid distance. As described in Girdhar & Ramanan [32] , we split the dataset with a ratio of 70: In order to produce labels from the purely unsupervisedly trained network, we train a separate small classifier with around 17k parameters, which only operates on the latent states of the trained Loci network with a correction and a gating network. The correction network computes a residual for the location of each entity. The gating network computes softmax probabilities, which are used to select the location belonging to the Snitch object similar to Equation 6. To prevent supervised gradient flow, the classifier network module is trained once the unsupervised training has finished. Eventually, the location and Gestalt codes are computed for the whole dataset and extracted into a separate data file, from which the classifier module is trained, again using a 70:30 train/test split. As shown in Table 1 , Loci not only surpasses all previous methods by a large margin-achieving a top 1 accuracy of 90.7% and a top 5 accuracy of 98.5% with an L 1 distance of 0.14-it also surpasses most of the existing work when the main architecture is trained purely unsupervised. Loci learns object permanence merely by means of inductive bias and regularization. This is shown in Figure 3 , where Loci keeps the representation of the contained Snitch in memory. While the shape details blur over time, importantly, the locating is correctly tracked to reconstruct the Snitch once it is revealed. Figure 4 also demonstrates the tracking performance in case of several co-occurring occlusions and double contained situations, where two cones are covering the Snitch. Here, the location output of the supervised gating network is marked in the video frames. In additional studies (detailed in Appendix A.3.5), we furthermore investigate Loci's object permanence and also demonstrate the effective disentanglement of position, Gestalt, size, and priority codes, which underlines Loci's ability to separate position and Gestalt of objects from a scene without supervision.

4.2. MOVING MNIST

The moving MNIST (MMNIST) challenge is a dataset for video prediction [77] . The task is to predict the motion of two moving digits, with a maximum size of 28 × 28 pixels, moving independently inside a 64 × 64 pixel window and bouncing off the walls in a predictable manner. While the dataset is usually implemented by simply bouncing the entire 28 × 28 MNIST sub-image within the 64 × 64 window, we first crop each digit to its actual appearance in order to obtain a realistic bounding box of the digit and thus to generate more realistic bouncing effects (once the actual number touches the border, instead of its superficial bounding box). This removes the undesired bias in the dataset to remember each digit's individual bounding box, individually, which is then the only way to correctly predict a bounce off a wall. Instead, now the network can predict the bounce based on the pixel information alone. While the task was originally formulated to predict the next 10 frames after receiving the first 10 frames, we also evaluate the ability to generate temporal closed-loop predictions for up to 100 frames after being only trained to predict 10 frames. We compare Loci to the state-of-the-art approach PhyDNet, also designed to disentangle 'what' from 'where'; more specifically: Physical dynamics from unknown factors such as the digit's Gestalt code. While using the original code provided by the authors, we did not reach the same reported performance of PhyDNet using our unbiased MMNIST dataloader. Nevertheless, PhyDNet reaches a high accuracy for the 10 frame prediction, which is only slightly topped by Loci. For extended temporal predictions, though, PhyDNet quickly dissolves the digits, while Loci preserves the Gestalt of each digit over the 100 predicted frames, as shown in Figure 5 . Table 2 shows a qualitative comparison between Loci and PhyDNet: While Loci consistently outperforms PhyDNet for the structural similarity index measure (SSIM, [90] ), PhyDNet has a lower MSE than Loci after 30 frames. This might be due to the blurring of PhyDNet, which accounts for the uncertainty in the position estimate at the cost of the digits shape.

4.3. OTHER DATASETS & SUPPLEMENTARY MATERIAL

Apart from a video footage for the main experiments and further algorithmic details on Loci, we provide additional insights and tests in the supplementary material: We evaluate the real world tracking performance of Loci on a ten hour aquarium footage found on YouTube Additionally, we examine Gestalt preservation and indicators of intuitive physics in closed loop predictions on the CLEVRER dataset [93] . Furthermore, several ablation studies are provided.

5. CONCLUSION AND FUTURE WORK

We presented Loci-a novel location and identity disentangling artificial neural network, which learns to process object-distinct position and Gestalt encodings. Loci significantly exceeds current state-of-the-art architectures on the CATER challenge, while requiring fewer network parameters and less supervision. Loci thus offers a critical step towards solving the binding problem [37; 79; 84] . We particularly hope that the mechanisms proposed in this work bear potential to enrich the field of object representation learning and highlight the importance of well-chosen inductive biases. Currently, Loci operates on static background estimates and cameras, which we denote as its main limitation. We intend to extend the background model to incorporate rich and varying backgrounds more flexibly in future work. Potential avenues for this extension may include an explicit, separate encoding of ego-motion and depth [52] . Furthermore, image reconstructions may be fused with subsequent image inputs in an information-driven manner. We also expect to create even more compact object-interaction encodings following theories of event-predictive cognition and related conceptions and models in computational neuroscience [12; 14; 30; 78] . Moreover, we are excited to explore the potential of Loci to address problems of intuitive physics and causality [66; 71; 75] , seeing that Loci offers a suitable compositional structure [37; 51] to enable symbolic reasoning. Finally, we hope that Loci will also be well-combinable with reinforcement learning and model-predictive planning in system control setups to pursue active, goal-directed environmental interactions.

A APPENDIX

As additional content, we first provide details on further evaluations we conducted with Loci. In order to evaluate the tracking performance of Loci in a real world example, we trained it on a 10 hour aquarium footage found on YouTube. The task poses the additional challenge to cope with backgrounds that are not fully stationary. The results are shown in Figure 6 and demonstrate that Loci is able to track 15+ objects in a complex real world environment. The Gestalt preserving performance of Loci for closed loop predictions are also demonstrated exemplary on the CLEVRER dataset in Figure 7 . Here, the effects of collisions of different geometric objects are predicted into the future. While the location deviates visibly over time, Loci is able to preserve the Gestalt code also for more complex objects in a closed loop setup, which is considered specifically challenging for RNNs.

A.1 CATER EVALUATION DETAILS

In order to produce the results from Table 1 we trained five networks with different initial seeds and and then evaluated each network five times with different initial seeds. In Table 1 we reported the mean results over the five evaluation runs from our best performing network (Network 3 from Table 3 ).

A.2 BACKGOUND MODEL

In order to compute an background model for datasets with a simple static background like CATER, CLEVR or CLEVRER we use a simple Gaussian Mixture Model. The specific function used is createBackgroundSubtractorMOG2 from OpenCV [9] which we use with a learningrate of 0.00001 to compute an background image based on the training set. For more complex backgrounds like in the aquarium example where the camera is still static, we use an autoencoder ResNet that encodes the input image into a latent vector with the same size as the Gestalt code. This latent background code is then run through an residual GateL0rd in order to capture temporal dynamics, like the chancing water level, and is then run through an ResNet decoder. The background autoencoder is pretrained using a L1 reconstruction loss in order to focus on the most dominant features of the background and to not care so much about foreground objects. In Figure 11 , we show the networks performance in terms of L2 object without certain regularization losses.

A.3.4 E-PROP VS BACK-PROPAGATION THROUGH TIME

In Figure 9 we compared the L2 object loss for truncated back-propagation through time with different sequence lengths against a version where we used back-propagation together with e-prop. In all experiments the networks are trained on the full Cater sequence with 300 frames, which are fed into the network sequentially. We then updated and afterwards detached the gradients using truncated back-propagation through time with different time intervals. Using e-prop informed gradients not only drastically sped up training time, but it was also more sample-efficient while achieving the same performance as truncated back-propagation through time with an interval of three time steps. Especially interesting is the experiment where we tested truncated back-propagation with a sequence length of one, without using e-prop's informed gradients. In this case, the performance is significantly worse than when using e-prop's informed gradients. 

A.3.5 LESION STUDIES

In order to quantify the effect of the Gestalt and position codes, we conducted an ablation by adding normally distributed noise with standard deviations σ ∈ [0, 0.1, ..., 1.0] to the Gestalt and position codes provided by the encoder-right before they were forwarded to the transition module-and calculated the resulting accuracies and errors, which are reported in Figure 12 and Figure 13 for the CATER and moving MNIST benchmarks, respectively. Furthermore, exemplary images of according position, priority, size, and Gestalt code manipulations are presented in Figure 14 and Figure 5 , again for the CATER and moving MNIST benchmarks. Effectively, manipulating the according codes result in changes in position, priority, size, or Gestalt code, demonstrating the disentangled encoding of object features in the respective latent codes. In particular, while the Gestalt-code manipulation results in changes of the actual numbers in the moving MNIST experiment under conservation of positions (first vs. third row of Figure 5 ), the position-code manipulation alters the position of the numbers in space without modifying the object shape (first vs. fourth row of Figure 5 ). Comparison between unsupervised (left) and supervised (right) tracking, while the Snitch is contained progressively longer. The x-axis depicts the number of frames the Snitch has been contained. The left y-axis shows the L 2 distance to the target position, while the right yaxis shows the number of times a Snitch was contained that long within the test-set (green shaded area). The thick line represents the median L 2 distance, while the thin line represents the mean L 2 distance. The weaker colored area shows the 90-10-quantile, while the stronger one corresponds to the 75-25-quantile. Updates L2 object No M k No R k No M s k and Q s k No Q k paper-baseline

A.4 GESTALT CODE INVESTIGATION

Here we investigate the latent landscape of the Gestalt codes learned by Loci when trained on the CATER challenge. To do this, we first selected a fixed amount (450) Gestalt codes created by Loci using the CATER dataset. We then clustered these Gestalt codes using a Gaussian Mixture model and then performed a dimensionality reduction using t-SNE. In some of the found clusters mainly the color of the objects was different, while in others the color and the shape varied. The images from Figure 16 where created by using a a Principal component analysis (PCA) to calculate the main axes of variance in different clusters. A random Gestalt code was chosen from a clusters and was modified by subtracting or adding a portion of the axes or in other words, the Gestalt code was used as a starting point and then new Gestalt codes were sampled walking down and up the selected axes. These new Gestalt codes are not binary anymore but were still clipped at 0 and 1, to create meaningful inputs to the decoder. We were able to identify an axis, that only varies the color of the objects, while the shape stays the same, and one axis, which only varies the shape, while the color stays the same. Since these axes are disentangled they can be used in an additive manner to simultaneously change the color and shape of objects.

A.5 OBJECT PERMANENCE EVALUATION

In order to analysed Loci's object permanence abilities we evaluated how the number of time steps the Snitch was hidden affects Loci's ability to correctly locate it. As shown in Figure 17 , where the tracking accuracy is plotted for progressively extended periods of Snitch containment. For both the supervised and unsupervised tracking, the location error remains constant after an initial increase, even over extended time spans.

A.6 E-PROP FOR GATEL0RD

GateL0RD is defined in Equation 10 till Equation 15 following Gumbsch et al. [39] with the gating network g, the candidate network r, and the cell state c. Next, we detail how we applied e-prop in GateL0RD. A.6.1 GATEL0RD MODEL g t j = Λ i θ rec,g j,i c t-1 i + i θ in,g j,i x t i + b g j + N (0, Σ) (gating network state) (10) core derivative and eligiblity determination:  r t j = ϕ i θ rec,r j,i c t-1 i + i θ in, Λ ′ = ∂Λ(x) ∂x = 0 if x ≤ 0 (1 -Λ(x) ϵ rec,g,t j,i = (1 -g t j )ϵ t-1 j,i + Λ ′ c t-1 i (r t j -c t-1 j ) (recurrent weights-specific elig.) ϵ in,g,t j,i = (1 -g t j )ϵ t-1 j,i + Λ ′ x t i (r t j -c t-1 j ) (input weights-specific elig.) ϵ b,g,t j,i = (1 -g t j )ϵ t-1 j,i + Λ ′ (r t j -c t-1 j ) (bias weight-specific elig.) (21) candidate network forward eligibility propagation: ∂E ∂θ j,i = ∂E ∂r t j ϵ t j,i (actual update signal) ( ) ϵ rec,r,t j,i = (1 -g t j )ϵ t-1 j,i + (1 -r t j 2 )c t-1 i g t j (recurrent weights-specific elig.) ϵ in,r,t j,i = (1 -g t j )ϵ t-1 j,i + (1 -r t j 2 )x t i g t j (input weights-specific elig.) ϵ b,r,t j,i = (1 -g t j )ϵ t-1 j,i + (1 -r t j 2 )g t j (bias weight-specific elig.) (25) Note that the partial derivatives ∂E ∂θj,i in Equation 18and Equation 22 address the current time step t only. Also note that ∂E ∂c t j and ∂E ∂r t j carry true gradient information of the current time step, backprogated through the feedforward connections of the architecture. In contrast, original e-prop uses local approximations of the learning signal only, which does not allow to stack multiple layers without losing the exact local gradient. Table 5 : Evaluations of Loci's foreground segmentation masks. Trained networks from the Cater challenge are evaluated on CLEVR with mask by running the single CLEVR images 30 iterations through the network and then comparing the masks using the Intersecion over Union (IoU) metric. We compare a mean per mask accuracy (mask avg) and size weighted average that represents how many pixels where segmented correctly (pixel avg). 

B LOCI ALGORITHM

Loci processes a sequence of RGB video-encoding images I ∈ R T ×H×W ×3 . Processing is mostly done slot-wise, whereby the system is initialized with a variable number of K processing slots. Its main components consist of a slot-wise encoder, a transition module, and a slot-wise decoder. Moreover, a background processing module is implemented. The slot-wise encoder is implemented by a tree-structured ResNet-based processing encoder (see Figure 18 ). The transition module processes slot-wise temporal dynamics and between-slot interaction dynamics (see Figure 19 ). The slot-wise decoder is again implemented by a ResNet (see Figure 20 ). For simple backgrounds, we use a Gaussian Mixture Model to obtain a default background estimate Rbg , which is used for the whole training set. For more complex backgrounds we use an additional Auto-Encoder Module. In the remainder, we denote scalar values by lower-case letters, tensors by upper-case letters, and vectors by bold letters. Moreover, we denote slot-specific activities with a subscript k ∈ 1, .., K and time by the superscript t. We drop t for temporary values. We now first define data and neural encoding sizes and types used throughout Loci's processing pipeline. We then specify neural activity initialization. Finally, we detail the unfolding overall processing loop.

B.1 SLOT-WISE ENCODER

Inputs The encoder inputs at each time step t consist of: • RGB input image I t ∈ R H×W ×3 , • MSE map E t ∈ R H×W ×1 (pixel-wise mean squared error between I t and Rt ), • Slot-specific RGB image reconstructions Rt k ∈ R H×W ×3 , • Slot-specific mask predictions M t k ∈ R H×W ×1 , • Slot-specific mask complements M t,s k = k ′ ∈{1,..,K}\k M t k ′ • Slot-specific isotropic Gaussian position map predictions Qt k ∈ R H×W , • Slot-specific Gaussian position map complements Qt,s k = k ′ ∈{1,..,K}\k Qt k ′ • Background mask M t bg ∈ R H×W , which is equivalent to 1 -k∈{1,..,K} M t k Outputs Based on these inputs, the slot-wise encoder network generates latent codes, which are forwarded to the transition module: • Slot-specific Gestalt codes G t k ∈ R Dg , • Slot-specific position codes P t k ∈ R 4 encode an isotropic Gaussian (µ t k , σ t k ) and a slotpriority code z t k , where D g denotes the size of the Gestalt code and µ t k ∈ R 2 .

B.2 TRANSITION MODULE

A transition module is used to process interaction dynamics within and between these slot-respective codes and creates a prediction for the next state, which is fed into the decoder. The input to the transition module equals G t , P t . It is processed across slots and per slot in the respective layers: Multi-Head Attention predicts slot interactions (across slots), while GateL0RD predicts slot-specific dynamics (per slot). In our main CATER implementation we use two attention layers with two heads each with GateL0RD layers in between. Outputs The outputs of the transition module Ĝt+1 k and P t+1 k have the same size as its inputs. Additionally, recurrent, slot-respective hidden states Ĥt k are maintained in the time-recurrent GateL0RD layers: • Slot-specific position codes P t k ∈ R 4 , • Slot-specific Gestalt codes Ĝt k ∈ R Dg , • Slot-specific GateL0RD-layer-respective hidden states Ĥt k ∈ R D h , where D h denotes the size of the recurrent latent states.

B.3 SLOT-WISE DECODER

Inputs The outputs of all slots from the transition module P t+1 and Ĝt+1 then act as the input to the decoder. Outputs The output of the decoder includes the slot-respective masks and RGB reconstructions: 

B.4 SEQUENCE INITIALIZATION

At time step t = 1 we determine the network's inputs based on randomly generated, fictive position and Gestalt estimates for each slot k, which are then fed through the Loci decoder module. Initial position and Gestalt codes P 1 k and Ĝ1 k are sampled from an isotropic Gaussian distribution N P osition (µ p , σ p ) and a factorized Gaussian distribution N Gestalt (µ g , σ g ) with learnable parameters µ p ∈ R 3 , σ p ∈ R and µ g ∈ R Dg , σ g ∈ R Dg , respectively. The third position code value of P 1 k , that is, the Gaussian standard deviation σ1 k , is set to 1/width, where width denotes the number of pixels in a row, effectively setting σ1 k to one pixel distance. The fourth position code value, that is, the priority value ẑ1 k , is set to its index value ẑ1 k ← k, inducing an ordered priority, which biases initial random slot assignments and thus bootstraps initial slot-assignment progress. Based on the initial codes, we generate estimates of the output mask M 1 k , reconstruction R1 k and Gaussian positions Q 1 k by calling the slot-wise decoder (see Algorithm 1 line 19 and following): M 1 k , R1 k , Q 1 k ← SlotW iseDecoder( P 1 k , Ĝ1 k ). We finally determine the first RGB image reconstruction R1 . The hidden states of the recurrent neural network GateL0RD are initialized to zero, that is, H 1 k ← 0.

B.5 MAIN PROCESSING LOOP

Loci first runs the main processing loop for ten time steps with the first video image of a particular image sequence. It thereby bootstraps the objects into individual slots, somewhat similar to previous slot-attention approaches [60] . After the ten initial time steps, Loci keeps re-initializing positions P t k of a slot k to random values (as specified above) given that max( M t k ) has been smaller than 0.5 for all time points until t. This induces an initial active search process. For longer video sequences, such as the Aquarium footage, this search process was also used for invisible slots, which increased the number of tracked objects, but negatively influenced object permanence. Note that the prediction error is calculated across the three RGB channels. It determines the MSE between the input I t and the static background R bg , multiplied per Hadamard-product with the forth square root MSE between the input I t and the predicted input Rt . The fourth square root emphasizes small differences, encouraging accurate encodings of individual entities. In the transition module we apply a single trainable parametric bias neuron alpha, as proposed in Bachlechner et al. [1] , instead of layer-normalization. Alpha is initialized to zero. Its current Algorithm 1 Loci-Algorithm (main processing loop) Ĝt+1 ← Sigmoid( Ĝt+1 ) 18: Ĝt+1 ← Ĝt+1 + Ĝt+1 (1 -Ĝt+1 )N (0, 1) 19: # Slot-wise decoder:  Q ′ k ← max(0, Q k -k ′ ∈{1,...,K}\k σ(z ′ k -z ′ k ′ ) • Q i ) 6: combine k ← Q ′ k ×G k # combine k ∈ R H ′ ×W ′ ×Dg 7: return combine k value is multiplied with the output vector before the residual parts of the transition module. These alpha-residuals enforce the predictor to initially compute the identity function. 



Source Code: https://github.com/CognitiveModeling/Loci



Figure 1: Loci's main processing loop has three components: Individual entities k are processed by a slot-wise encoder module to generate the disentangled latent Gestalt code G k and position code P k . The transition module consists of L alternating blocks of multi-head self-attention (modeling slot-toslot interactions) and GateL0RD (predicting within-slot dynamics). The resulting Gestalt and position codes are finally combined by the slot-wise decoder into entity-respective RGB reconstructions R k and masks M k . Tensors colored in red ( M t,s k , Qt,s k ) are calculated as complements from M t k and Qt k .

Figure 3: Object permanence shown on a CATER tracking example. While the network (after completely unsupervised training) struggles to keep the shape of the Snitch when contained for a longer time span, color and, importantly, the position of the Snitch are preserved during containment. Top row: tracked objects visualized trough colored masks. Bottom row: RGB representation of the Snitch.

30 into a training and test set and further put aside 20% of the training set as a validation set, leaving 56% of the original data for training. To blend-in the supervision loss, we first set the supervision factor µ s = 0.01 for the first 4 epochs, fostering mostly unsupervised training, and then set µ s = 0.1 for the duration of the training process and to µ s = 0.3 during the last epochs, to give the Snitch location a weak pull towards the target location.

Figure 4: Two challenging CATER tracking examples with several co-occurring containments / occlusions: Video frames overlaid with the predicted Snitch location (network trained with supervision).

Figure 5: Comparison between ground truth top row, PhyDNet center row and Loci bottom row for a prediction of up to 90 frames. Both PhyDNet and Loci were trained on 10 frame prediction. While in PhyDNet, the appearance of the digits dissolves after a few frames beyond the initial training distribution, Loci manages to keep the Gestalt code and the location accurate until the fourth collision at around frame 42. The Gestalt codes remain stable until the end of the considered 100 time steps.

Figure 6: Fully unsupervised real world tracking example trained on 10 hour aquarium footage.

Figure 8: Comparison between GateL0RD and LSTM modules within the predictor part of the network.

Figure 9: Comparison between different sequence lengths for truncated back-propagation trough time vs back-propagation and e-prop.

Figure 10: Importance of different input channels.

Figure 11: Importance of regularization losses.

Figure16: Traversing the Gestalt code manifold: Using a Gaussian Mixture together with t-SNE we find distinct clusters in the latent Gestalt code. Traversing the main axis of variance given by a PCA we found a color axis (top rob) and a shape axis (middle row). Since these axes are disentangled they can be used in an additive manner to simultaneously change the color and shape (bottom row)

Slot-specific mask outputs M t+1 k ∈ R H×W , • Slot-specific RGB image reconstructions Rt+1 k ∈ R H×W ×3 , which are then used as part of the input at the next iteration as specified above. We generate the combined reconstructed image Rt+1 by summing all slot estimates Rt+1 k and the background estimate Rbg weighted with their corresponding masks M t+1 k and M t+1 bg , as specified further in Algorithm 1.

p encodes all pixel positions normalized to [-1, 1], width the number of pixels in a . . . ,M t+1 K , M t+1 bg ] ← softmax(concat[ M t+1 1 , . . . , M t+1 K , Mbg ], axis = 'K') 26: Rt+1 ← sum(concat Rt+1 1 , .., Rt+1 K , Rbg • M t+1 , axis = 'K') 27: end for 28: return [ R1 . . . RT ] Algorithm 2 Priority-based-Attention 1: Inputs: Gestalt: G k ∈ R 1,Dg , Gaussian 2d position: Q k ∈ R H ′ ×W ′ ×1, priority: z ∈ R K Additional parameters: values of the learnable θ w ∈ R K are initially set to 25, while θ b ∈ R K = {0, 1, . . . , (K -1)} induces a default slot-order bias.3: z ′ ← (z•K+N (0, 0.1)+θ b )•θ w # Scale priorities and add Noise 4: # Subtract Gaussian attention from other slots, scaled by priority (σ denotes the sigmoid) 5:

Figure 18: Encoder module diagram (identical for all slots k)

Quantitative results of the CATER Snitch localization challenge. Referenced results are from[96] and from the other system-respective papers.

Moving MNIST prediction accuracy PhyDNet vs Loci. Both networks where trained using the same dataloader to predict the next 90 frames given an input sequence of 10 frames.

Evaluations on the CATER Snitch challenge for supervised training. Table shows five networks trained with different initial seeds, each evaluated five times with different initial seeds.

Evaluations on the CATER Snitch challenge for supervised training, statistics of the 25 evaluations presented in Table3

Inputs: Input video I ∈ R T ×H×W ×3 , static background Rbg ∈ R H×W ×3 2: Network parameters: Θ encoder , Θ transition , Θ decoder 3: Additional parameters: initialization parameters Θ init ; background threshold θ bg , which is encoded as a uniform offset mask Mbg ← θ bg ; number of slots K; processing steps T # see Section B.4 for details 5: for t = 1 . . . T do Ĝt+1 , P t+1 , H t+1 = T ransitionM odule(G t , P t , H t )

The mechanism bootstraps Loci's learning progress by initially focusing it on developing decoder-suitable Gestalt encodings.

6. ACKNOWLEDGEMENT

This work received funding from the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy -EXC number 2064/1 -Project number 390727645 as well as from the Cyber Valley in Tübingen, CyVy-RF-2020-15. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Manuel Traub and Matthias Karlbauer, and the Alexander von Humboldt Foundation for supporting Martin Butz and Sebastian Otte. We also thank Simon Frank for the Gestalt-Code analysis in Figure 16 .

A.3 ABLATIONS

Using the binary cross-entropy with non binary targets, while producing valid gradients, gives little insights into the network's actual performance. In order to better compare different designs and to also take the objects into focus, we use a modified L 2 -loss for our ablation studies that is computed based on Equation 9:Here the MSE is masked with the error between the input and the background. As a result, we get a much higher error when the network produces a tracking error compared to a background reconstruction error. The L2 object loss takes into account instances, where, for example, the network overlooks an object or, makes a mistake in the prediction of the movement of an object. It thus offers itself as a good metric for comparing the performance of different architectures during training.

A.3.1 GATEL0RD VS LSTM

As shown in Figure 8 , using GateL0RD within the predictor network significantly increases the L2 object -Loss. Thus, it appears that GateL0RD's piece-wise constant latent state regularization mechanism indeed suitably biases the network towards assuming object permanence.

A.3.2 INPUT CHANNELS

In Figure 10 , we show the networks performance in terms of L2 object when we zero out different input channels. 

C POTENTIAL NEGATIVE SOCIETAL IMPACTS

Here we take potential negative social impacts of Loci into consideration. Orientated on the NeurIPS Ethics Guidelines we identified two topics where Loci could potentially be used in an harmful way:Could Loci directly facilitate injury to living beings? At its current stage it is unlikely that Loci could directly be used in any weapons systems. However, when developing Loci further, its unsupervised nature allows training on huge amounts of unlabeled data and then fine tune a weapons systems to identify and track a specific target.Could Loci help develop or extend harmful forms of surveillance? Potentially yes; Loci already works well with static cameras and backgrounds. While we did not evaluate Loci's performance on pedestrians, it is likely that Loci could be trained to track pedestrians in a surveillance setting. Supervised fine-tuning could then potentially be used to identify a specific target group of people.While these potential misuses of Loci are concerning, they are an unavoidable byproduct of advancing the field of (unsupervised) object tracking as a whole. Seeing that current object tracking systems are mostly black boxes, particularly when it comes to deciphering how and why they track certain entities, Loci may enable better control over what is tracked and where exactly tracking may be applied. Hopefully, Loci can thus be used to mitigate unwanted tracking biases, or, at least, to facilitate the identification of such tracking biases.

