TEMPORAL AND OBJECT QUANTIFICATION NETS

Abstract

We aim to learn generalizable representations for complex activities by quantifying over both entities and time, as in "the kicker is behind all the other players," or "the player controls the ball until it moves toward the goal." Such a structural inductive bias of object relations, object quantification, and temporal orders will enable the learned representation to generalize to situations with varying numbers of agents, objects, and time courses. In this paper, we present Temporal and Object Quantification Nets (TOQ-Nets), which provide such structural inductive bias for learning composable action concepts from time sequences that describe the properties and relations of multiple entities. We evaluate TOQ-Nets on two benchmarks: trajectory-based soccer event detection, and 6D pose-based manipulation concept learning. We demonstrate that TOQ-Nets can generalize from small amounts of data to scenarios where there are more agents and objects than were present during training. The learned concepts are also robust with respect to temporally warped sequences, and easily transfer to other prediction tasks in a similar domain.

1. INTRODUCTION

When watching a soccer match (Fig. 1 ), we see more than just players and a ball moving around. Rather, we see events and actions in terms of high-level concepts, including relations between agents and objects. For example, people can easily recognize when one player has control of the ball, or when a player passed the ball to another player. This cognitive act is effortless, intuitive, and fast. Machines can recognize actions, too, but generally based on limited windows of space and time, and with weak generalization. Consider the variety of complicated spatio-temporal trajectories that passing can refer to (Fig. 1 ), and how an intelligent agent could learn this. The act of passing does not seem to be about the pixel-level specifics of the spatio-temporal trajectory. Rather, a pass is a high-level action-concept that is composed of other concepts, such as possession (A pass begins with one player in possession of the ball and ends with another player in possession) or kick. The concept of passing can itself be re-used in a compositional way, for example to distinguish between a short pass and a long pass. We propose Temporal and Object Quantification Nets (TOQ-Nets), structured neural networks that learn to describe complex activities by quantifying over both entities and time. TOQ-Nets are motivated by the way in which humans perceive actions in terms of the properties of and relations between agents and objects, and the sequential structure of events (Zacks et al., 2007; Stränger & Hommel, 1996) . A TOQ-Net is a multi-layer neural network whose inputs are the properties of agents and objects and their relationships in a scene, which may change over time. In a soccer game, these inputs might be the 3D position of each player and the ball. Each layer in the TOQ-Net performs either object or temporal quantification, which can emulate and realize disjunctive and conjunctive quantification over the properties and relationships between entities, as in "the kicker is behind all the other players", as well as quantification over time, as in "the player controls the ball until it moves fast towards the goal." The key idea of TOQ-Nets is to use tensors to represent the relational features between agents and objects (e.g., the player controls the ball, and the ball is moving fast), and to use tensor pooling operations over different dimensions to realize temporal and object quantifiers (all and until). Thus, by stacking these object and temporal quantification operations, TOQ-Nets can learn to construct higher-level concepts of actions based on the relations between entities over time, starting from low-level position and velocity input and supervised with only high-level class labels. We evaluate TOQ-Nets on two perceptually and conceptually different benchmarks for action recognition: trajectory-based soccer event detection and 6D pose-based manipulation concept learning. We show that the TOQ-Net makes several important contributions. First, TOQ-Nets outperform both convolutional and recurrent baselines for modeling relational-temporal concepts across both benchmarks. Second, by exploiting the temporal-relational features learned through supervised learning, TOQ-Nets achieve strong few-shot generalization to novel actions. Finally, TOQ-Nets exhibit strong generalization to scenarios where there are more agents and objects than were present during training. They are also robust w.r.t. time warped input trajectories. Meanwhile, the learned concepts can also be easily transferred to other prediction tasks in a similar domain.

2. RELATED WORK

Action concept representations and learning. TOQ-Nets impose a structural inductive bias that describes actions with quantification over objects (every entity ..., there exists an entity ...) and quantification over times (an event happens until some time and then some other event begins), which is motivated by first-order and linear temporal logics (Pnueli, 1977) . Such representations have been studied for analyzing sporting events (Intille & Bobick, 1999; 2001) and daily actions (Tran & Davis, 2008; Brendel et al., 2011) using logic-based reasoning frameworks. However, these frameworks require extra knowledge to annotate relationships between low-level, primitive actions and complex ones. By contrast, TOQ-Nets enable end-to-end learning of complex action descriptions with only high-level action-class labels. Cognitive science has long recognized the importance of structural representations of actions and events in human reasoning, including in language, memory, perception, and development (see e.g. Stränger & Hommel, 1996; Zacks et al., 2001; 2007; Pinker, 2007; Baldwin et al., 2001) . In this paper, we focus on the role of object and temporal quantification in learning action representations. Temporal and relational reasoning. This paper is also related to work on using neural networks and other data-driven models for modeling temporal structure. Early work includes ADL description languages (Intille & Bobick, 1999; Zhuo et al., 2019) , hidden Markov models (Tang et al., 2012) , and and-or graphs (Gupta et al., 2009; Tang et al., 2013) . These models need human-annotated action descriptions (e.g., pick up x means a state transition from not holding x to holding x) and specialpurpose inference algorithms such as graph structure learning algorithms. In contrast, TOQ-Nets have an end-to-end design and can be integrated with arbitrary differentiable modules such as convolutional neural networks. People have also used structural representations to model object-centric temporal concepts with temporal graph convolution networks (Yan et al., 2018; Materzynska et al., 2020; Wang & Gupta, 2018; Ji et al., 2020) 



Figure 1: Illustrative example of action recognition in a soccer simulator. The action concept short pass is composed of other concepts, such as possession and kick.

. Meanwhile, Deng et al. (2016) proposed to integrate graph neural networks with RNNs by replacing the message aggregation step in GNN propagation with RNNs to capture temporal information. Bialkowski et al. (2014); Ibrahim et al. (2016)) have proposed to build a personal-level and a group-level feature extractor for group activity recognition. The high-level idea is to use RNNs to encode per-person features across the video and use another RNN to combine features of individual persons into a group feature for every frame. StagNet Qi et al. (2018) and Spatial-Temporal Interaction Networks (Materzynska et al., 2020) combines graph neural networks with RNNs and spatial-temporal attention models to model temporal information. TOQ-Nets use a similar object-centric relational representation, but different models for temporal structures. There

