NEURAL SPATIO-TEMPORAL REASONING WITH OBJECT-CENTRIC SELF-SUPERVISED LEARNING

Abstract

Transformer-based language models have proved capable of rudimentary symbolic reasoning, underlining the effectiveness of applying self-attention computations to sets of discrete entities. In this work, we apply this lesson to videos of physical interaction between objects. We show that self-attention-based models operating on discrete, learned, object-centric representations perform well on spatio-temporal reasoning tasks which were expressly designed to trouble traditional neural network models and to require higher-level cognitive processes such as causal reasoning and understanding of intuitive physics and narrative structure. We achieve state of the art results on two datasets, CLEVRER and CATER, significantly outperforming leading hybrid neuro-symbolic models. Moreover, we find that techniques from language modelling, such as BERT-style semi-supervised predictive losses, allow our model to surpass neuro-symbolic approaches while using 40% less labelled data. Our results corroborate the idea that neural networks can reason about the causal, dynamic structure of visual data and attain understanding of intuitive physics, which counters the popular claim that they are only effective at perceptual pattern-recognition and not reasoning per se.

1. INTRODUCTION

Artificial intelligence research has long been divided into rule-based approaches and statistical models. Neural networks, a classic example of the statistical approach, certainly have limitations despite their massive popularity and success. For example, experiments with two recently released video question-answering datasets, CLEVRER (Yi et al., 2020) and CATER (Girdhar & Ramanan, 2020) , demonstrate that neural networks fail to adequately reason about spatio-temporal and compositional structure in visual scenes. While the networks perform adequately when asked to describe their inputs, they tend to fail when asked to predict, explain, or consider counterfactual possibilities. By contrast, a neuro-symbolic model called NS-DR (Yi et al., 2020) appears to be much better suited to predicting, explaining, and considering counterfactual possibilities with this data. The model leverages independent neural networks to detect objects, infer dynamics, and syntactically parse the question. A hand-coded symbolic executor interprets the questions grounded on the outputs of the networks. The fact that hybrid models employing both distributed (neural) representations and symbolic logic can sometimes perform better has led some to consider neuro-symbolic hybrids to be a more promising model class compared to end-to-end neural networks (Andreas et al., 2016; Yi et al., 2018; Marcus, 2020) . There is evidence from other domains, however, that neural networks can indeed adequately model higher-level cognitive processes. For example, in some symbolic domains (such as language), neural networks outperform hybrid neuro-symbolic approaches when tasked to classify or predict (Devlin et al., 2018) . Neural models have also had some success in mathematics, a domain that, intuitively, would seem to require the execution of formal rules and manipulation of symbols (Lample & Charton, 2020) . Somewhat surprisingly, large-scale neural language models such as GPT-3 (Brown et al., 2020) can acquire a propensity for arithmetic reasoning and analogy-making without being trained explicitly for such tasks, suggesting that current neural network limitations are ameliorated when scaling to more data and using larger, more efficient architectures (Brown et al., 2020; Mitchell, 2020) . A key motivation of our work, therefore, is to reconcile existing neural network limitations in video domains with their (perhaps surprising) successes in symbolic domains. One common element of these latter results is the repeated application of self-attention processes (Vaswani et al., 2017) to sequences of discrete 'entities'. Here, we apply this insight to videos of physical interactions between sets of objects, where the input data to models are continuously-valued pixel arrays at multiple timesteps (together with symbolic questions in certain cases). A key design decision is the appropriate level of granularity for the discrete units underlying the selfattention computation. What is the visual analogue to a word in language, or a symbol in mathematics? We hypothesize that the discrete entities acted upon by self-attention should correspond to semantic entities relevant to the task. For tasks based on visual data derived from physical interactions, these entities are often times objects (van Steenkiste et al., 2019; Battaglia et al., 2018) . To extract representations of these entities, we use MONet, an unsupervised object segmentation model (Burgess et al., 2019) , but we leave open the possibility that other object-estimation algorithms might work better. We propose that a sufficiently expressive self-attention model acting on entities corresponding to physical objects will exhibit, on video datasets, a similar level of higher-level cognition and 'reasoning' seen when these models are applied to language or mathematics. Altogether, our results demonstrate that self-attention-based neural nets can outperform hybrid neurosymbolic models on visual tasks that require high-level cognitive processes, such as causal reasoning and physical understanding. We show that choosing the right level of discretization is critical for successfully learning these higher-order capabilities: pixels and local features are too fine, and entire scenes are too coarse. Moreover, we identify the value of self-supervised tasks, especially in low data regimes. These tasks ask the model to infer future arrangements of objects given the past, or to infer what must have happened for objects to look as they do in the present. We verify these conclusions in two video datasets, one in which the input is exclusively visual (CATER) and one that requires the combination of language (questions) and vision (CLEVRER).

2. METHODS

Our principal motivation is the converging evidence for the value of self-attention mechanisms operating on a finite sequences of discrete entities. Written language is inherently discrete and hence is well-suited to self-attention-based approaches. In other domains, such as raw audio or vision, it is less clear how to leverage self-attention. We hypothesize that the application of self-attention-based models to visual tasks could benefit from an approximate 'discretization' process analogous to the segmentation of speech into words or morphemes, and that determining the appropriate level of discretization is an important choice that can significantly affect model performance. At the finest level, data could simply be discretized into pixels (as is already the case for most machine-processed visual data). But since pixels are too-fine grained, some work considers the downsampled "hyper-pixel" outputs of a convolutional network to comprise the set of discrete units (e.g. Zambaldi et al. (2019); Lu et al. (2019) ). In the case of videos, an even courser discretization scheme is often used: representations of frames or subclips (Sun et al., 2019b) . The neuroscience literature, however, suggests that biological visual systems infer and exploit the existence of objects, rather than use spatial or temporal blocks with artificial boundaries (Roelfsema et al., 1998; Spelke, 2000; Chen, 2012) . Because objects are the atomic units that tasks we consider here focus on, it makes sense to discretize on the level of objects. Numerous object segmentation algorithms have been proposed (Ren et al., 2015; He et al., 2017; Greff et al., 2019) . We chose to use MONet, an unsupervised object segmentation algorithm that produces object representations with disentangled features (Burgess et al., 2019) . Because MONet is unsupervised, we can train it directly in our domain of interest without the need for object segmentation labels. To segment each frame into object representations, MONet first uses a recurrent attention network to obtain a set of N o "object attention masks" (N o is a fixed parameter). Each attention mask represents the probability that any given pixel belongs to that mask's object. The pixels assigned to the mask are encoded into latent variables with means µ ti ∈ R d , where i indexes the object and t the frame. These means are used as the object representations in our model. More details are provided in Appendix A.1. The self-attention component is a transformer model (Vaswani et al., 2017) over the sequence µ ti . In addition to this sequence of vectors, we include a trainable vector CLS ∈ R d that is used to generate classification results; this plays a similar role to the CLS token in BERT (Devlin et al., 2018) . Finally, for our CLEVRER experiments, where the inputs include a question and potentially several

