HOPPER: MULTI-HOP TRANSFORMER FOR SPATIOTEMPORAL REASONING

Abstract

This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to our approach is the notion of object permanence, i.e., the ability to reason about the location of objects as they move through the video while being occluded, contained or carried by other objects. Existing deep learning based approaches often suffer from spatiotemporal biases when applied to video reasoning problems. We propose Hopper, which uses a Multi-hop Transformer for reasoning object permanence in videos. Given a video and a localization query, Hopper reasons over image and object tracks to automatically hop over critical frames in an iterative fashion to predict the final position of the object of interest. We demonstrate the effectiveness of using a contrastive loss to reduce spatiotemporal biases. We evaluate over CATER dataset and find that Hopper achieves 73.2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames. We also demonstrate Hopper can perform long-term reasoning by building a CATER-h dataset 1 that requires multi-step reasoning to localize objects of interest correctly.

1. INTRODUCTION

In this paper, we address the problem of spatiotemporal object-centric reasoning in videos. Specifically, we focus on the problem of object permanence, which is the ability to represent the existence and the trajectory of hidden moving objects (Baillargeon, 1986) . Object permanence can be essential in understanding videos in the domain of: (1) sports like soccer, where one needs to reason, "which player initiated the pass that resulted in a goal?", (2) activities like shopping, one needs to infer "what items the shopper should be billed for?", and (3) driving, to infer "is there a car next to me in the right lane?". Answering these questions requires the ability to detect and understand the motion of objects in the scene. This requires detecting the temporal order of one or more actions of objects. Furthermore, it also requires learning object permanence, since it requires the ability to predict the location of non-visible objects as they are occluded, contained or carried by other objects (Shamsian et al., 2020) . Hence, solving this task requires compositional, multi-step spatiotemporal reasoning which has been difficult to achieve using existing deep learning models (Bottou, 2014; Lake et al., 2017) . Existing models have been found lacking when applying to video reasoning and object permanence tasks (Girdhar & Ramanan, 2020) . Despite rapid progress in video understanding benchmarks such as action recognition over large datasets, deep learning based models often suffer from spatial and temporal biases and are often easily fooled by statistical spurious patterns and undesirable dataset biases (Johnson et al., 2017b) . For example, researchers have found that models can recognize the action "swimming" even when the actor is masked out, because the models rely on the swimming pool, the scene bias, instead of the dynamics of the actor (Choi et al., 2019) . Hence, we propose Hopper to address debiased video reasoning. Hopper uses multi-hop reasoning over videos to reason about object permanence. Humans realize object permanence by identifying key frames where objects become hidden (Bremner et al., 2015) and reason to predict the motion and final location of objects in the video. Given a video and a localization query, Hopper uses a Multi-hop Transformer (MHT) over image and object tracks to automatically identify and hop over critical frames in an iterative fashion to predict the final position of the object of interest. Additionally, Hopper uses a contrastive debiasing loss that enforces consistency between attended objects and correct predictions. This improves model robustness and generalization. We also build a new dataset, CATER-h, that reduces temporal bias in CATER and requires long-term reasoning. We demonstrate the effectiveness of Hopper over the recently proposed CATER 'Snitch Localization' task (Girdhar & Ramanan, 2020) (Figure 1 ). Hopper achieves 73.2% Top-1 accuracy in this task at just 1 FPS. More importantly, Hopper identifies the critical frames where objects become invisible or reappears, providing an interpretable summary of the reasoning performed by the model. To summarize, the contributions of our paper are as follows: First, we introduce Hopper that provides a framework for multi-step compositional reasoning in videos and achieves state-of-the-art accuracy in CATER object permanence task. Second, we describe how to perform interpretable reasoning in videos by using iterative reasoning over critical frames. Third, we perform extensive studies to understand the effectiveness of multi-step reasoning and debiasing methods that are used by Hopper. Based on our results, we also propose a new dataset, CATER-h, that requires longer reasoning hops, and demonstrates the gaps of existing deep learning models.

2. RELATED WORK

Video understanding. Video tasks have matured quickly in recent years (Hara et al., 2018) ; approaches have been migrated from 2D or 3D ConvNets (Ji et al., 2012) to two-stream networks (Simonyan & Zisserman, 2014), inflated design (Carreira & Zisserman, 2017), models with additional emphasis on capturing the temporal structures (Zhou et al., 2018) , and recently models that better capture spatiotemporal interactions (Wang et al., 2018; Girdhar et al., 2019) . Despite the progress, these models often suffer undesirable dataset biases, easily confused by backgrounds objects in new environments as well as varying temporal scales (Choi et al., 2019) . Furthermore, they are unable to capture reasoning-based constructs such as causal relationships (Fire & Zhu, 2017) or long-term video understanding (Girdhar & Ramanan, 2020) . Visual and video reasoning. Visual and video reasoning have been well-studied recently, but existing research has largely focused on the task of question answering (Johnson et al., 2017a; Hudson & Manning, 2018; 2019a; Yi et al., 2020) . CATER, a recently proposed diagnostic video recognition dataset focuses on spatial and temporal reasoning as well as localizing particular object of interest. There also has been significant research in object tracking, often with an emphasis on occlusions with the goal of providing object permanence (Wojke et al., 2017; Wang et al., 2019b) . Traditional object tracking approaches often require expensive supervision of location of the objects in every frame. In contrast, we address object permanence and video recognition on CATER with a model that performs tracking-integrated object-centric reasoning without this strong supervision. Multi-hop reasoning. Reasoning systems vary in expressive power and predictive abilities, which include symbolic reasoning, probabilistic reasoning, causal reasoning, etc. (Bottou, 2014) . Among them, multi-hop reasoning is the ability to reason with information collected from multiple passages to derive the answer (Wang et al., 2019a) , and it gives a discrete intermediate output of the reasoning process, which can help gauge model's behavior beyond just the final task accuracy (Chen et al., 2019) . Several multi-hop datasets and models have been proposed for the reading comprehension



Figure 1: Snitch Localization in CATER (Girdhar & Ramanan, 2020) is an object permanence task where the goal is to classify the final location of the snitch object within a 2D grid space.

funding

* Work done as a NEC Labs intern. 1 https://github.com/necla-ml/cater-h 

