MOVEMENT-TO-ACTION TRANSFORMER NETWORKS FOR TEMPORAL ACTION PROPOSAL GENERATION

Abstract

The task of generating temporal action proposals is aimed at identifying temporal intervals containing human actions in untrimmed videos. For arbitrary actions, this requires learning long-range interactions. We propose an end-to-end Movementto-Action Transformer Network (MatNet) that uses results of human movement studies to encode actions ranging from localized, atomic, body part movements, to longer-range, semantic movements involving subsets of body parts. In particular, we make direct use of the results of Laban Movement Analysis (LMA). We use LMA-based measures of movements as computational definitions of actions.From the input of RGB + Flow (I3D) features and 3D pose, we compute LMA based lowto-high-level movement features, and learn action proposals by applying two heads on the boundary Transformer, three heads on the proposal Transformer and using five types of losses. We visualize and explain relations between the movement descriptors and attention map of the action proposals. We report results from a number of experiments on the Thumos14, ActivityNet and PKU-MMD datasets, showing that MatNet achieves SOTA or better performance on the temporal action proposal generation task. 1 



c1 ∈ R T ×2 Cognitive Attention {Direct,Indirect} Weight c2 ∈ R T ×1 Intention {Strong,Light} Time c3 ∈ R T ×1 Decision {Sudden,Sustained} Flow c4 ∈ R T ×1 ). Each Factor is also associated with a movement's underlying (cognitive) category, and its geometric structure and space (plane), that capture the position, direction, rotation, velocity, acceleration, distance, curvature and volume associated with the movement, and have values between two extremes (Col 7). Human movement analysts point out that although each individual may combine the Factors in ways that are specific to the individual and their cultural, personal and artistic preferences, {c i } 8 i=1 remain valid for all movements Bartenieff & Lewis (1980) and human activities Santos (2014), and can be used to describe human movement at the semantic level. In this paper, we use these Factors as bases to obtain temporal action proposals through MatNet. MatNet automatically determines combinations of the Factors most suited for action detection and localization. (2021) . The former employ a fixed-size sliding window or anchors to first predict action proposals, and then refine their boundaries based on the estimated confidence scores of the proposals. The latter first generate probabilities that each frame is in the middle or at a boundary of an action, and then obtain an optimal proposal based on the confidence scores of the proposals. However, the confidence scores are based on local information, and without making full use of long-range (global) context. Although different techniques have been proposed to model local and global contextual information, the video information they use is low-level. They do not incorporate multilevel representations, e.g., from low-level video features to higher-level models of human body structure and dynamics. In this paper, we incorporate such knowledge using Laban theory of human movement Guest (2005) . Laban Movement Analysis (LMA) is a widely used framework that captures qualitative aspects of movement important for expression and communication of actions, emotions, etc. Originally, LMA characterizes movement using five components: Body, Effort, Shape, Space and Relationship (Table 1 ). Each addresses specific properties of movement and can be represented in Labanotation Guest (2005), a movement notation tool. LMA is a good representation, integrating high-level semantic features and low-level kinematic features. Li et al. (2019) analyzes different kinds of dance movements and generates Labanotation scores. However, they work with manually trimmed videos. There are some early works that segment dance movements using LMA. Bouchard & Badler (2007) detects movement boundaries from large changes in a series of weighted LMA components. SONODA et al. (2008) presents a method of segmenting whole body dance movement in terms of "unit movements" which are defined using LMA components of effort, space and shape; constructed based on the judgments of dance novices, and used as primitives. However, they do not use hierarchical We will make all of the data sets, resources and programs publicly available.



Figure 1: Overview of our MatNet architecture. It contains two main components: (1) Movement descriptors, shown in the middle column, and (2) Transformer networks for action proposal generation, shown in the right column. Given a sequence of untrimmed video frames, MatNet uses Laban Movement Analysis constructs to generate body part (atomic) level and subset-of-parts (semantic) level descriptors of human movements from videos, and input as movement representations to actionboundary sensitive Transformer networks to generate action proposals. With advances in the understanding of trimmed, human action videos, the focus has begun to shift to longer and untrimmed videos. This has increased the need for segmentation of the videos into action clips, namely, identification of temporal intervals containing actions. This is the goal of the task of temporal action proposals generation (TAPG) for human action understanding. There are many factors that make the problem challenging: (1) Location: an action can start at any time. (2) Duration: The time taken by the action can vary greatly. (3) Background: Irrelevant content can

be highly diverse. (4) Number of Actions: Unknown and unlimited. (5) Action Set: Unknown (6) Ordering: Unknown. Many problems can benefit from accurate localization of human activities, such as activity recognition, video captioning and action retrieval Ryoo et al. (2020); Deng et al. (2021). Recent approaches can be divided into top-down (anchor-based) Gao et al. (2017a); Liu et al. (2019a); Gao et al. (2020) and bottom-up (boundary-based) Lin et al. (2019); Su et al. (2021); Islam et al.

LMA classifies movement into three main categories (Col 1) Santos (2014): (1) Non-Kinematic (2) Kinematic, each represented by two Components (Col 2) -Effort and Shape for (1), and Body and Space for (2). Category 3 is about Relationships between (1) and (2) (Col 1, 2). The components describe the categories using eight Factors, denoted by multidimensional variables {c i } 8 i=1 , (Col 4), each having a different dimension (Col 5

