TEMPORAL AND OBJECT QUANTIFICATION NETS

Abstract

We aim to learn generalizable representations for complex activities by quantifying over both entities and time, as in "the kicker is behind all the other players," or "the player controls the ball until it moves toward the goal." Such a structural inductive bias of object relations, object quantification, and temporal orders will enable the learned representation to generalize to situations with varying numbers of agents, objects, and time courses. In this paper, we present Temporal and Object Quantification Nets (TOQ-Nets), which provide such structural inductive bias for learning composable action concepts from time sequences that describe the properties and relations of multiple entities. We evaluate TOQ-Nets on two benchmarks: trajectory-based soccer event detection, and 6D pose-based manipulation concept learning. We demonstrate that TOQ-Nets can generalize from small amounts of data to scenarios where there are more agents and objects than were present during training. The learned concepts are also robust with respect to temporally warped sequences, and easily transfer to other prediction tasks in a similar domain.

1. INTRODUCTION

When watching a soccer match (Fig. 1 ), we see more than just players and a ball moving around. Rather, we see events and actions in terms of high-level concepts, including relations between agents and objects. For example, people can easily recognize when one player has control of the ball, or when a player passed the ball to another player. This cognitive act is effortless, intuitive, and fast. Machines can recognize actions, too, but generally based on limited windows of space and time, and with weak generalization. Consider the variety of complicated spatio-temporal trajectories that passing can refer to (Fig. 1 ), and how an intelligent agent could learn this. The act of passing does not seem to be about the pixel-level specifics of the spatio-temporal trajectory. Rather, a pass is a high-level action-concept that is composed of other concepts, such as possession (A pass begins with one player in possession of the ball and ends with another player in possession) or kick. The concept of passing can itself be re-used in a compositional way, for example to distinguish between a short pass and a long pass. We propose Temporal and Object Quantification Nets (TOQ-Nets), structured neural networks that learn to describe complex activities by quantifying over both entities and time. TOQ-Nets are motivated by the way in which humans perceive actions in terms of the properties of and relations between agents and objects, and the sequential structure of events (Zacks et al., 2007; Stränger & Hommel, 1996) . A TOQ-Net is a multi-layer neural network whose inputs are the properties of agents and objects and their relationships in a scene, which may change over time. In a soccer game, these inputs might be the 3D position of each player and the ball. Each layer in the TOQ-Net performs either object or temporal quantification, which can emulate and realize disjunctive and conjunctive quantification over the properties and relationships between entities, as in "the kicker is behind all the other players", as well as quantification over time, as in "the player controls the ball until it moves fast towards the goal." The key idea of TOQ-Nets is to use tensors to represent the relational features between agents and objects (e.g., the player controls the ball, and the ball is moving fast), and to use tensor pooling operations over different dimensions to realize temporal and object quantifiers (all and until). Thus, by stacking these object and temporal quantification operations, TOQ-Nets can learn to construct higher-level concepts of actions based on the relations between entities over time, starting from low-level position and velocity input and supervised with only high-level class labels.

