NEURAL SPATIO-TEMPORAL REASONING WITH OBJECT-CENTRIC SELF-SUPERVISED LEARNING

Abstract

Transformer-based language models have proved capable of rudimentary symbolic reasoning, underlining the effectiveness of applying self-attention computations to sets of discrete entities. In this work, we apply this lesson to videos of physical interaction between objects. We show that self-attention-based models operating on discrete, learned, object-centric representations perform well on spatio-temporal reasoning tasks which were expressly designed to trouble traditional neural network models and to require higher-level cognitive processes such as causal reasoning and understanding of intuitive physics and narrative structure. We achieve state of the art results on two datasets, CLEVRER and CATER, significantly outperforming leading hybrid neuro-symbolic models. Moreover, we find that techniques from language modelling, such as BERT-style semi-supervised predictive losses, allow our model to surpass neuro-symbolic approaches while using 40% less labelled data. Our results corroborate the idea that neural networks can reason about the causal, dynamic structure of visual data and attain understanding of intuitive physics, which counters the popular claim that they are only effective at perceptual pattern-recognition and not reasoning per se.

1. INTRODUCTION

Artificial intelligence research has long been divided into rule-based approaches and statistical models. Neural networks, a classic example of the statistical approach, certainly have limitations despite their massive popularity and success. For example, experiments with two recently released video question-answering datasets, CLEVRER (Yi et al., 2020) and CATER (Girdhar & Ramanan, 2020) , demonstrate that neural networks fail to adequately reason about spatio-temporal and compositional structure in visual scenes. While the networks perform adequately when asked to describe their inputs, they tend to fail when asked to predict, explain, or consider counterfactual possibilities. By contrast, a neuro-symbolic model called NS-DR (Yi et al., 2020) appears to be much better suited to predicting, explaining, and considering counterfactual possibilities with this data. The model leverages independent neural networks to detect objects, infer dynamics, and syntactically parse the question. A hand-coded symbolic executor interprets the questions grounded on the outputs of the networks. The fact that hybrid models employing both distributed (neural) representations and symbolic logic can sometimes perform better has led some to consider neuro-symbolic hybrids to be a more promising model class compared to end-to-end neural networks (Andreas et al., 2016; Yi et al., 2018; Marcus, 2020) . There is evidence from other domains, however, that neural networks can indeed adequately model higher-level cognitive processes. For example, in some symbolic domains (such as language), neural networks outperform hybrid neuro-symbolic approaches when tasked to classify or predict (Devlin et al., 2018) . Neural models have also had some success in mathematics, a domain that, intuitively, would seem to require the execution of formal rules and manipulation of symbols (Lample & Charton, 2020) . Somewhat surprisingly, large-scale neural language models such as GPT-3 (Brown et al., 2020) can acquire a propensity for arithmetic reasoning and analogy-making without being trained explicitly for such tasks, suggesting that current neural network limitations are ameliorated when scaling to more data and using larger, more efficient architectures (Brown et al., 2020; Mitchell, 2020) . A key motivation of our work, therefore, is to reconcile existing neural network limitations in video domains with their (perhaps surprising) successes in symbolic domains.

