MOVEMENT-TO-ACTION TRANSFORMER NETWORKS FOR TEMPORAL ACTION PROPOSAL GENERATION

Abstract

The task of generating temporal action proposals is aimed at identifying temporal intervals containing human actions in untrimmed videos. For arbitrary actions, this requires learning long-range interactions. We propose an end-to-end Movementto-Action Transformer Network (MatNet) that uses results of human movement studies to encode actions ranging from localized, atomic, body part movements, to longer-range, semantic movements involving subsets of body parts. In particular, we make direct use of the results of Laban Movement Analysis (LMA). We use LMA-based measures of movements as computational definitions of actions.From the input of RGB + Flow (I3D) features and 3D pose, we compute LMA based lowto-high-level movement features, and learn action proposals by applying two heads on the boundary Transformer, three heads on the proposal Transformer and using five types of losses. We visualize and explain relations between the movement descriptors and attention map of the action proposals. We report results from a number of experiments on the Thumos14, ActivityNet and PKU-MMD datasets, showing that MatNet achieves SOTA or better performance on the temporal action proposal generation task. 1 



We will make all of the data sets, resources and programs publicly available.1



Figure 1: Overview of our MatNet architecture. It contains two main components: (1) Movement descriptors, shown in the middle column, and (2) Transformer networks for action proposal generation, shown in the right column. Given a sequence of untrimmed video frames, MatNet uses Laban Movement Analysis constructs to generate body part (atomic) level and subset-of-parts (semantic) level descriptors of human movements from videos, and input as movement representations to actionboundary sensitive Transformer networks to generate action proposals. With advances in the understanding of trimmed, human action videos, the focus has begun to shift to longer and untrimmed videos. This has increased the need for segmentation of the videos into action clips, namely, identification of temporal intervals containing actions. This is the goal of the task of temporal action proposals generation (TAPG) for human action understanding. There are many factors that make the problem challenging: (1) Location: an action can start at any time. (2) Duration: The time taken by the action can vary greatly. (3) Background: Irrelevant content can

