HOPPER: MULTI-HOP TRANSFORMER FOR SPATIOTEMPORAL REASONING

Abstract

This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to our approach is the notion of object permanence, i.e., the ability to reason about the location of objects as they move through the video while being occluded, contained or carried by other objects. Existing deep learning based approaches often suffer from spatiotemporal biases when applied to video reasoning problems. We propose Hopper, which uses a Multi-hop Transformer for reasoning object permanence in videos. Given a video and a localization query, Hopper reasons over image and object tracks to automatically hop over critical frames in an iterative fashion to predict the final position of the object of interest. We demonstrate the effectiveness of using a contrastive loss to reduce spatiotemporal biases. We evaluate over CATER dataset and find that Hopper achieves 73.2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames. We also demonstrate Hopper can perform long-term reasoning by building a CATER-h dataset 1 that requires multi-step reasoning to localize objects of interest correctly.

1. INTRODUCTION

In this paper, we address the problem of spatiotemporal object-centric reasoning in videos. Specifically, we focus on the problem of object permanence, which is the ability to represent the existence and the trajectory of hidden moving objects (Baillargeon, 1986) . Object permanence can be essential in understanding videos in the domain of: (1) sports like soccer, where one needs to reason, "which player initiated the pass that resulted in a goal?", (2) activities like shopping, one needs to infer "what items the shopper should be billed for?", and (3) driving, to infer "is there a car next to me in the right lane?". Answering these questions requires the ability to detect and understand the motion of objects in the scene. This requires detecting the temporal order of one or more actions of objects. Furthermore, it also requires learning object permanence, since it requires the ability to predict the location of non-visible objects as



Figure 1: Snitch Localization in CATER (Girdhar & Ramanan, 2020) is an object permanence task where the goal is to classify the final location of the snitch object within a 2D grid space.

funding

* Work done as a NEC Labs intern. 1 https://github.com/necla-ml/

