MOVEMENT-TO-ACTION TRANSFORMER NETWORKS FOR TEMPORAL ACTION PROPOSAL GENERATION

Abstract

The task of generating temporal action proposals is aimed at identifying temporal intervals containing human actions in untrimmed videos. For arbitrary actions, this requires learning long-range interactions. We propose an end-to-end Movementto-Action Transformer Network (MatNet) that uses results of human movement studies to encode actions ranging from localized, atomic, body part movements, to longer-range, semantic movements involving subsets of body parts. In particular, we make direct use of the results of Laban Movement Analysis (LMA). We use LMA-based measures of movements as computational definitions of actions.From the input of RGB + Flow (I3D) features and 3D pose, we compute LMA based lowto-high-level movement features, and learn action proposals by applying two heads on the boundary Transformer, three heads on the proposal Transformer and using five types of losses. We visualize and explain relations between the movement descriptors and attention map of the action proposals. We report results from a number of experiments on the Thumos14, ActivityNet and PKU-MMD datasets, showing that MatNet achieves SOTA or better performance on the temporal action proposal generation task.



With advances in the understanding of trimmed, human action videos, the focus has begun to shift to longer and untrimmed videos. This has increased the need for segmentation of the videos into action clips, namely, identification of temporal intervals containing actions. This is the goal of the task of temporal action proposals generation (TAPG) for human action understanding. There are many factors that make the problem challenging: (1) Location: an action can start at any time. (2) Duration: The time taken by the action can vary greatly. (3) Background: Irrelevant content can 1) and (2) (Col 1, 2). The components describe the categories using eight Factors, denoted by multidimensional variables {c i } 8 i=1 , (Col 4), each having a different dimension (Col 5). Each Factor is also associated with a movement's underlying (cognitive) category, and its geometric structure and space (plane), that capture the position, direction, rotation, velocity, acceleration, distance, curvature and volume associated with the movement, and have values between two extremes (Col 7). Human movement analysts point out that although each individual may combine the Factors in ways that are specific to the individual and their cultural, personal and artistic preferences, {c i } 8 i=1 remain valid for all movements Bartenieff & Lewis (1980) and human activities Santos (2014) , and can be used to describe human movement at the semantic level. In this paper, we use these Factors as bases to obtain temporal action proposals through MatNet. MatNet automatically determines combinations of the Factors most suited for action detection and localization. be highly diverse. (4) Number of Actions: Unknown and unlimited. (5) Action Set: Unknown (6) Ordering: Unknown. Many problems can benefit from accurate localization of human activities, such as activity recognition, video captioning and action retrieval Ryoo et al. (2020) ; Deng et al. (2021) . Recent approaches can be divided into top-down (anchor-based) Gao et al. (2017a) ; Liu et al. (2019a) ; Gao et al. (2020) and bottom-up (boundary-based) Lin et al. (2019) ; Su et al. (2021) ; Islam et al. (2021) . The former employ a fixed-size sliding window or anchors to first predict action proposals, and then refine their boundaries based on the estimated confidence scores of the proposals. The latter first generate probabilities that each frame is in the middle or at a boundary of an action, and then obtain an optimal proposal based on the confidence scores of the proposals. However, the confidence scores are based on local information, and without making full use of long-range (global) context. Although different techniques have been proposed to model local and global contextual information, the video information they use is low-level. They do not incorporate multilevel representations, e.g., from low-level video features to higher-level models of human body structure and dynamics. In this paper, we incorporate such knowledge using Laban theory of human movement Guest (2005) . Laban Movement Analysis (LMA) is a widely used framework that captures qualitative aspects of movement important for expression and communication of actions, emotions, etc. Originally, LMA characterizes movement using five components: Body, Effort, Shape, Space and Relationship (Table 1 ). Each addresses specific properties of movement and can be represented in Labanotation Guest (2005) , a movement notation tool. LMA is a good representation, integrating high-level semantic features and low-level kinematic features. Li et al. (2019) analyzes different kinds of dance movements and generates Labanotation scores. However, they work with manually trimmed videos. There are some early works that segment dance movements using LMA. Bouchard & Badler (2007) detects movement boundaries from large changes in a series of weighted LMA components. SONODA et al. (2008) presents a method of segmenting whole body dance movement in terms of "unit movements" which are defined using LMA components of effort, space and shape; constructed based on the judgments of dance novices, and used as primitives. However, they do not use hierarchical Under review as a conference paper at ICLR 2023 motion patterns, which are central to human description of dances (e.g., patterns like hop-step left-hop-step right), and are therefore limited to only the basic movements. In this paper, we propose end-to-end Movement-to-Action Transformer Networks (MatNet) for temporal action proposal generation. As shown in Figure 1 , our TAPG Transformer consists of two main types of modules: Movement Descriptors and Transformer networks. The former includes an atomic movement descriptor F a (Sec. 4.1), that can recognize movements of each body part, and a semantic movement descriptor F s (Sec. 4.2), that can quantitatively describe the human movement. The Transformer networks are comprised of a boundary Transformer Φ b (Sec. 4.3) and a proposal Transformer Φ p (Sec. 4.4); they enable capturing long-range contextual information. The main contributions of this paper are as follows: • We propose end-to-end Movement-to-Action Transformer Networks that use a range of low (atomic) to high (semantic) level human movements for temporal action proposal generation. • Our high level features are based on movement concepts evolved by human movement (e.g., dance) experts. • Our method is robust to occlusions of humans by other humans or objects because our underlying human pose detector (using LCRNet) has such robustness. 

2. RELATED WORK

Temporal Action Segmentation aims to segment an untrimmed video and label each segment with one of a set of pre-defined action labels. Some network-based methods capture both short-and long-term dependencies Farha & Gall (2019); Gao et al. (2021) . Some methods used are weaklysupervised Li et al. (2021) ; Fayyaz & Gall (2020) . Others propose unsupervised segmentation of complex activities without any additional input Sarfraz et al. (2021) ; Li & Todorovic (2021) . Temporal Action Proposal Generation. Unlike temporal action segmentation, which refers to both action localization and recognition of actions from a given set, temporal action proposal generation (TAPG) is about the more general localization of actions without knowledge of the names, numbers, and order of the actions, which requires long-range global information. 

3. LMA AS A MOVEMENT DESCRIPTOR

Given an untrimmed video {I t } T t=1 , our goal is to generate a set of N g proposals Ψ = {( tn s , tn e )} N n=1 (where t s and t e are the starting and ending times), that are close to the ground-truth action proposals Ψ = {(t n s , t n e )} N n=1 . As in Fig. 1 , proposed MatNet uses a range of features to capture the movements of body parts (limbs, joints, torso, head). The most primitive of these are their individual instantaneous displacements and temporal trajectories, which we call atomic movements, and combinations of atomic movements, as high level constructs we call semantic movements. While the atomic movements are basic primitives, for semantic features we do not use our own constructs; instead, we use descriptors evolved by dance experts. These descriptors use dance vocabulary given by the Laban Motion Analysis (LMA) Santos ( 2014) system, introduced in Santos ( 2014), and well studied and defined in terms of kinematics and dynamics equations. Table 1 presents details of the LMA representation. In the rest of this section, we present a brief overview of the five LMA components (Table 1 , Col 2) that are central to our proposed methods. (1) Non-Kinemetic-Effort captures dynamic characteristics with respect to "inner intention". It involves four factors: Spacec 1 describes the person's attention to the environment when moving, with values ranging from direct (single-focused) to indirect (multi-focused). It can be formulated as the moving direction (θ x , θ y ) of the person in the horizontal plane; a stable direction represents a direct movement and an unstable direction represents the opposite. Weightc 2 describes the strength of the movement with intention on the person's own body, with values ranging from strong (fast and powerful) to light (slow and fragile). It can be formulated as the average velocity of body joints; a larger velocity represents a stronger movement. The Timec 3 indicates if the person has decided and knows the right moment to move, ranging from sustained (leisurely) to sudden (instantaneous, in a hurry). It can be formulated as the average acceleration of body joints; a larger acceleration represents a more sudden movement. The Flowc 4 describes control of the movement ranging from free (uncontrollable) to bound (controlled). It can be formulated as the sum of the jerks of the body joints; a larger jerk represents a freer movement. (2) Non-Kinemetic-Shape studies the way the body changes shape during movement and involves three factors: Shapingc 5 describes shape changes with respect to the environment as seen from the (x, y, z) directions, or projections on the vertical, horizontal and sagittal planes. It can be formulated as the area of convex hull of the body joint locations projected on the three planes; a large area on each plane represents rising, spreading and advancing, whereas a small area represents sinking, enclosing and retreating. Directionalc 6 again describes shape changes but in terms of joint movements, ranging from spoke-like (body joint moves in a direct line) to arc-like (body joint moves in an arc). It can be formulated as the curvature of the movement trajectory of each limb's end joint with root joint as the origin; a large curvature represents arc-like movement and a small curvature represents a more spoke-like movement. Shape Flowc 7 describes self-motivated growing and shrinking of the "internal Kinesphere". It can be formulated as the volume of convex hull of 3D body joint locations; a large volume represents growing Kinesphere. (3) Kinemetic-Body studies structural interrelationships within the body while moving, describing which body parts are moving, connected and influenced by others. It has only one factor, Bodyc 8 , which can be formulated as average rotation angles of joints. (4) Kinemetic-Space is about the level and direction of a body part's movement. We skip this component since it is already captured by our low-level "atomic" body part movements and we want to capture only non-local, semantic movements. (5) Relationships between a person and their surroundings. We also skip this component since it is less well defined and our current architecture focuses on the movement without modeling the surround and interactions with it. 

4.1. ATOMIC MOVEMENT DESCRIPTOR

Since we do not have ground truth for many of the available large human action datasets, we use unsupervised methods for body part movement recognition, e.g., in Hu & Ahuja (2021) . Given a sequence of 3D poses {{p j t } j∈Je } T -1 t=0 of all the joints j ∈ J e connected to a body part e, we classify the body part e's movements m e = { me t } T -1 t=0 using movement labels (left, right, etc.), and characterizing homogeneity of motion direction and level as in Hu & Ahuja (2021) , representing all limb movements in a coordinate frame centered on torso. We first calculate the velocity v e of the end joint j of each limb e, e.g., wrist for lower arm, elbow for upper arm, ankle for lower leg and knee for the upper leg, as follows: v ij t = ---→ P i t P j t - ------→ P i t-1 P j t-1 , where i and j are the root and end joints of the limb, respectively. Then the velocity vector v ij t is transformed to be in the torso coordinate system: ṽij t = v ij t • v torso t ||v torso t || , where v torso t are the 3D coordinates of the torso. To extract major sustained movements in a direction, we identify salient peaks and valleys in the velocity profile Λ = {ṽ jk t } T t=1 curve. To smooth out the profile noise, we first smooth the velocity profile Λ using a low pass filter while retaining a majority of the power. Then we identify k peaks and valleys with the largest area. Finally, we save the timestamps of the k peaks and valleys with the labels of the corresponding movements. In experiments, we estimate the movements m e = { me t } T -1 t=0 ∈ R T ×45 of 14 body parts (arm, leg, torso, hip, shoulder, head), each having 2 ∼ 4 binary movement labels (e.g., move up vs move down, extension vs flexion).

4.2. SEMANTIC MOVEMENT DESCRIPTOR

We now discuss how we compute the eight factors {c i } 8 i=1 (Tab. 1). (1) c 1 -Space Effort: c 1 ranges from direct (moving straight to the target) to indirect (not moving straight) Cui et al. (2019) . Considering that the person is mostly moving in a horizontal plane, c 1 at time t is defined as the heading direction in the x-y plane: Space: c 1 = [θx t , θy t ] T (2) c 2 -Weight Effort: c 2 ranges from strong to light, and is estimated from the sum of the kinetic energy of the torso and distal body limbs (e.g., head, hands, feet). The higher the peak kinetic energy, the stronger the Weight. c 2 at time t is defined as: Weight: c 2 = j∈J α j E j (t) = j∈J α j v j (t) 2 (4) where J is the set of body joints, α j is the mass coefficient for each joint, and v j (t) 2 is the square of the speed of the joint at time t. since c 2 of a body joint is influenced mainly by its speed Samadani et al. (2020) , we set the mass coefficients to 1 for all the body joints as in Hachimura et al. (2005) ; Samadani et al. (2020) . (3) c 3 -Time Effort: c 3 ranges from sudden to sustained. Sudden movements are characterized by large values in the acceleration sequence, as defined below, compared to sustained movements characterized by 0 acceleration. The acceleration for the j th body part at time t is defined as the Under review as a conference paper at ICLR 2023 change in velocity per unit time: a j (t) = v j (t) -v j (t -1) , (5) In Equation 5, acceleration is defined as the change in velocity over a unit time (∆t = 1). The sum of the accelerations of the torso and end-effectors is used to estimate c 3 for full-body movements: Time: c 3 = j∈J a j (t). (4) c 4 -Flow Effort: c 4 ranges from free to bound, and is computed as the aggregated jerk, third order derivative of the position, over a given time period ∆t (1 in our case) for the torso and end-effectors. Flow: c 4 = j∈J a j (t) -a j (t -∆t) where a j (t) is the Cartesian acceleration of the j th body part at time t. (5) c 5 -Shape Shaping: c 5 is used to primarily describe concavity and convexity of the torso in the (i) vertical, (ii) horizontal, and (iii) sagittal planes Lamb (1965) , capturing Rising/Sinking (vertical plane), Widening/Narrowing (Horizontal plane), and Advancing/Retreating (Sagittal plane) Lamb (1965) . (i) is due to the torso's upward-downward displacement Dell (1977) , and quantified by its maximum value. (ii) is due to torso's forward-backward displacement Dell (1977) , and is quantified by its maximum value. (iii) is mainly sideward over the body. As in Dell (1977) , we estimate c 5 as the area of the convex hull of body's projection on the horizontal plane. (6) c 6 -Shape Directional: c 6 ranges from spoke-like to arc-like, describes the transverse behavior of the limb movements Dell (1977) , and is captured as the curvature of the movement of the end joint of the limb in a 2D plane within which the largest displacement of the limb occurs. we estimate it using the 2D curvature within the extracted 2D (xy plane) at time t, as follows, Directional: c 6 = (ÿ (t) ẋ (t) -ẍ (t) ẏ (t)) 2 ( ẋ2 (t) + ẏ2 (t)) 3/2 (8) where ẋ (t) and ẍ (t) indicate the first and second derivatives of the x trajectory at time t, respectively. (7) c 7 -Shape Flow: c 7 ranges from growing to shrinking. Dell (1977) suggests the use of the "reach space" for estimating c 7 . Three areas of reach are: 1) near (knitting), 2) intermediate (gesturing), and 3) far (space reached by the whole arm when extended out of the body without locomotion). Therefore, the limits of far reach are the limits of LMA's personal kinesphere, the space around the body which can be reached without taking a step Dell (1977) . We estimate c 7 as the maximum volume of the convex hull (bounding box) containing the stretched body and limbs. (8) c 8 -Body: c 8 captures the bending of joints throughout the body. For the example of arm, the bending degree (c 8 ) is calculated from the shoulder (S), elbow (E), and the wrist (W) joints, in terms of the two limb vectors -→ ES and --→ EW , as follows: c 8 = arccos -→ ES • --→ EW | -→ ES|| --→ EW | (9)

4.3. BOUNDARY TRANSFORMER

The standard Transformer model is composed of an encoder and a decoder, with several feed-forward and multi-head layers. The multi-head self-attention layer models the interactions between the current frame and all other frames of a video sequence. To keep the low-level information of the video, we extract the I3D (Inflated 3D Networks) Carreira & Zisserman (2017) representation as an additional feature. I3D is a widely adopted 3D convolutional network trained on the Kinetics dataset, and the I3D representation contains spatiotemporal information directly from videos. Then the I3D representations of all frames x = {x} T -1 t=0 ∈ R T ×2048 by Carreira & Zisserman (2017) are stacked with the 3D poses P, and the two semantic movement representations m and c, denoted as h ∈ R T ×d , and given as the input to the boundary transformer Φ b as well as to the proposal transformer Φ p . To encode the proposal-level long-term temporal dependency for boundary regression, and then the frame-level short-term dependency for proposal generation, a Boundary Transformer Φ b (h) is used to generate the starting and ending boundary probability sequences (p s , p e ) = {(p t s , p t e )} T t=1 ∈ R T ×2 , which are further multiplied with h as the weighted input to Φ p . We construct Φ b using the standard Under review as a conference paper at ICLR 2023 Transformer architecture described above. The encoder of Φ b maps h stacked with a positional encoding Vaswani et al. (2017) to a hidden representation h enc ∈ R T ×d . The decoder of Φ b takes the hidden representation h b enc as the query Q and key K, and takes a value V ∈ R d b ×d initialized with zeros, and outputs a global representation of the boundaries h b dec ∈ R d b ×d where d is the feature vector size and d b is the number of queries. In Φ b , d b is the length of the sequence. Eventually, a starting boundary head and an ending boundary head, consisting of a multi-layer perceptron and a Sigmoid layer, are appended to generate the starting and ending probabilities (p s , p e ) ∈ R T ×2 . The Loss function of the boundary transformer is defined as: L b = - 1 N T N T t=1 p t s log y t s + 1 -p t s log 1 -y t s - 1 N T N T t=1 p t e log y t e + 1 -p t e log 1 -y t e , where y t s and y t e are the ground truth labels of the boundary. 4.4 PROPOSAL TRANSFORMER Unlike Tan et al. (2021) , which uses an additional backbone network to generate and save the boundary scores which are un-trainable when training the proposal transformer, we multiply the starting and ending probabilities (p s , p e ) from the boundary transformer Φ b with the input feature h to generate boundary-attentive representations h as input to the proposal transformer Φ p . Similar to the architecture of boundary transformer Φ b , proposal transformer Φ p takes the product of the boundary-attentive representations h and a set of proposal queries ∈ R Ng×d as input, and outputs the proposal representations h p dec ∈ R Ng×d , where N g is the number of proposal queries, and the proposal queries are themselves a product of the training, initialized randomly. Eventually, a proposal head, a classification head and an IoU head are appended to generate a set of proposals  Ψ = {( tn s , tn e )} L cls = - 1 N g Ng n=1 (p n cls log y n cls + (1 -p n cls ) log (1 -y n cls )) where y n cls denotes the ground truth labels of proposal classification. An L1 regression loss to refine the boundaries is defined as: L loc = Ng n=1 | tn s -t n s | + | tn e -t n e |; a tIoU loss to measure the overlap is defined as: L tIoU = Ng n=1 1 -t IoU ψ n , ψn ; and an IoU prediction loss, as in Tan et al. (2021) , is defined as: (15) where we choose the weight values experimentally. We follow Tan et al. (2021) to iteratively train the different heads. L IoU = Ng n=1 ||p n iou -y n iou ||

5. EXPERIMENTS

We use the PKU-MMD Chunhui et al. (2017 ), ActivityNet Fabian Caba Heilbron & Niebles (2015) and THUMOS14 Jiang et al. (2014) We use Adam optimizer to train MatNet for 50 epochs. The learning rate is 1e -4, and the batch size is 32. We use 3 encoding layers and 6 decoding layers for both boundary and proposal Transformers. The length of the sequence d b is set to be 100 and the number of proposals N g expected to be generated is 32. We use a step size of 8 to extract frame sequences. We set the weights α, β, λ, ω and ι in Eq. 4 to be 1, 1, 5, 2 and 100.

5.1. EVALUATION METRICS

We use the metric AR@AN (Average Recall (AR) over average number of proposals) under specified temporal Intersection over Union (tIoU) thresholds, which are set to [0.5: 0.05: 1] for Thumos14 and PKU-MMD, and [0.5: 0.05: 0.95] for ActivityNet. We also use the metric mAP (mean Average Precision under multiple tIoU). When a predicted temporal segment satisfies a tIoU threshold with the ground truth action label, this segment is considered as a true positive. The tIoU thresholds are set as 0.5, 0.75, 0.95 for ActivityNet, 0.3, 0.4, 0.5, 0.6, 0.7 for THUMOS-14 and 0.1, 0.3, 0.5 for PKU-MMD. We use the classifier of UntrimmedNet Wang et al. (2017) to compute the mAP scores.

5.2. QUANTITATIVE RESULTS

Method AR@AN mAP @50 @100 @200 0.3 0.4 0.5 0.6 0.7 TURN Gao et al. (2017b) Tables 5, 6 and 7 show the results of ablation studies on the effectiveness of 3D pose representation P, atomic movement representation m and semantic movement representation c on all the datasets, measured by AR@AN and mAP. The baseline only takes I3D x as input. The results show that using 3D pose improves the performance significantly, which may be because the pose is helping in differentiating the frames containing human movement from the background frames not containing any human. When P, c and m are added cumulatively, the performance further improves steadily.

5.4. QUALITATIVE VISUALIZATIONS

Figure 2 shows the visualization of two sample results. During inference, we use bipartite matching as in Tan et al. (2021) to select the top-1 proposals from all the generated candidate proposals, which is order-independent. The blue color in semantic movement features means there are recognized body part movements, and the high values of atomic movement features also represent large human movement. We can see that the generated proposals are well aligned with these input features.

6. LIMITATIONS AND FUTURE WORK

Our performance will suffer when pose information is missing, e.g., in many everyday videos containing complex human activities such as applying makeup, where only face is visible and the activities involve subtle (e.g., finger or within face) movements, and surfing, where the person image has a small number of pixels. In furture, we plan to consider action appropriate, additional modalities to enhance performance for the actions involved, such as audio.



We will make all of the data sets, resources and programs publicly available.



Figure 1: Overview of our MatNet architecture. It contains two main components: (1) Movement descriptors, shown in the middle column, and (2) Transformer networks for action proposal generation, shown in the right column. Given a sequence of untrimmed video frames, MatNet uses Laban Movement Analysis constructs to generate body part (atomic) level and subset-of-parts (semantic) level descriptors of human movements from videos, and input as movement representations to actionboundary sensitive Transformer networks to generate action proposals.

Many methods have been proposed to model local and global contextual information in videosChéron et al. (2015);Gu et al. (2018);Zolfaghari et al. (2017);Choutas et al. (2018);Zhang et al. (2018);Asghari-Esfeden et al. (2020);Hsieh et al. (2022);Qing et al. (2021a). In addition,Xu et al. (2020);Chen et al. (2021) formulate the action detection problem as a sub-graph localization problem using a graph convolutional network (GCN). By providing more flexible and precise action proposals, TAPG can help to correct the results of the temporal action segmentation or action recognition and provide a foundation for other applications.Transformer Based Methods. RTD-NetTan et al. (2021) uses Transformer for proposal generation, by weighing the input of the proposal Transformer by pre-calculated boundary scores.TAPGT Wang  et al. (2021)  proposes to adopt two Transformers to generate the boundary and proposal in parallel. TadTRLiu et al. (2022b)  uses Transformer to map a small set of learned action query embeddings to corresponding action predictions adaptively with a Transformer encoder-decoder architecture. E2E-TADLiu et al. (2022a)  attaches a detection head to the last layer of the Transformer encoder, and optimizes the head and the video encoder simultaneously. ActionFormerZhang et al. (2022) combines a multi-scale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. Our MatNet integrates the boundary-attention strategy fromWang et al. (2021) and the boundary and proposal Transformers fromTan et al. (2021), by directly multiplying the output of the boundary Transformers Φ b with input features of the proposal Transformer Φ p , which, together with our movement descriptors, provides an end-to-end Movement-to-Action architecture with better modeling of long-and-short-term dependencies.

central theme of our approach is to represent movement in terms of the aforementioned eight, quantitative, semantic descriptors, c = {c i } 8 i=1 , shown as different icons in the blue block (Semantic Movement Descriptor) in Fig 1 and derivable from mover's position, direction, rotation, velocity, acceleration, distance, curvature and volume. As shown in Fig. 1, our LMA based MatNet is composed of four main parts: an atomic movement descriptor F a , a semantic movement descriptor F s , a boundary transformer Φ b and the final, proposal transformer Φ p . Since these parts depend on 3D pose P = {P } T -1 t=0 , we first extract pose from the given video using LCRNet Rogez et al. (2019) which Under review as a conference paper at ICLR 2023 detects multi-person 3D poses in natural images. To capture motion, we use the I3D representation of each frame x = {x} T -1 t=0 following Carreira & Zisserman (2017). Using these 3D pose and motion results, we implement our methods for these four parts. The atomic movement descriptor F a estimates the movements m = {{m e t } e∈E }} T -1 t=0 of each body part e ∈ E from the trajectories {{P j t } j∈Je } T -1 t=0 of all joints j ∈ J e connected to e, J e ⊂ E. The semantic movement descriptor F s outputs kinematics-invariant LMA representations c = {c} T -1 t=0 of the entire body. Then the sequence of extracted movement features m and c, together with 3D pose P = {P } T -1 t=0 and I3D features x = {x} T -1 t=0 , are concatenated, denoted as h = m ⌢ c ⌢ P ⌢ x, and taken as input to the boundary transformer Φ b . The boundary transformer Φ b generates the start and end boundary probabilities {(p t s , p t e )} T t=1 to weigh the input feature sequence h, and the weighted feature sequence h is taken as input to the proposal transformer Φ p to generate the action proposals Ψ = {( tn s , tn e )} Ng n=1 . The following subsections present details of the four parts.

Ng n=1 that are close to the ground-truth action proposals Ψ = {(t n s , t n e )} N n=1 , a set of proposal classification scores {p n cls } Ng n=1 , and predicted IoUs {p n iou } N n=1 . The Loss function of the proposal transformer consists of a binary classification loss defined as:

OBJECTIVE FUNCTION The overall objective function of the proposed MatNet is defined as a weighted summation of the boundary and proposal transformer losses in Sec 4.3 and 4.4: L = αL b + βL cls + λL loc + ωL tIoU + ιL IoU .

Figure 2: Qualitative visualization of the generated proposals (row 2) with corresponding poses (row 3), semantic movement features (row 4), atomic movement features (row 5), and attention map of the starting (row 6) and ending (row 7) boundaries on two samples from THUMOS14 Jiang et al. (2014) and PKU-MMD Chunhui et al. (2017) datasets.



for evaluation. THUMOS14 has 413 temporally untrimmed videos. ActivityNet contains about 200 activity classes, with 10k training videos, 5k validation videos and 5k test videos. PKU-MMD contains 1076 videos of 51 action categories. Under review as a conference paper at ICLR 2023

Comparison of proposal generation results using AR@AN and mAP on THUMOS14.

Analogous results for the ActivityNet Dataset

Analogous results for the PKU-MMD DatasetChunhui et al. (2017).Under review as a conference paper at ICLR 2023 To evaluate the quality of the generated proposals, we calculate AR@AN and mAP on THUMOS14, ActivityNet and PKU-MMD, respectively. Tables 2, 3 and 4 show AR@AN and mAP under different tIoU thresholds. For a fair comparison, we retrainedTan et al. (2021) on the PKU-MMD dataset. The AR@AN and mAP scores of the other state-of-the-art methods are from their papers. Our MatNet superior results on THUMOS14 and PKU-MMD, and achieves comparable performance on ActivityNet. Specifically, on the PKU-MMD dataset, MatNet outperform the rest methods, implying that MatNet works best on the indoor dataset, where poses are clearer without too much noise. In addition, on THUMOS14, MatNet achieves comparable results on the mAP scores compared to the state-of-the-art methods under high tIoU, indicating the proposals generated by MatNet have more precise boundaries, and are robust to occlusion and multi-person scenarios. Moreover, although the annotations of ActivityNet are sparse (about 1.41 activity instances per video), MatNet still can achieve comparable results. Finally, our method achieves AR@AN and mAP improvements over our baselineTan et al. (2021) at all AN and tIoU thresholds by a big margin, demonstrating that instead of using a pre-trained model to generate the boundary score, using a boundary transformer to generate boundary score and training jointly can boost the performance.

Ablation study of different combinations of the components in MatNet using AR@AN and mAP on THUMOS14.

Analogous results for the ActivityNet.

Analogous results for PKU-MMDChunhui et al. (2017).

A CODE

We have made our training and inference code available to the reviewers in the submission zip file.

B DEMO VIDEOS

The evaluation we present in the main paper Sec. 5 Fig. 2 is in terms of statistics of matches provided by our proposal generation method with the ground truth. We evaluate the generated proposals in terms of their match with the ground truth number and locations of actions detected, and their starting and ending frames.Here we present some representative video clips identified by our method within a larger range of actions. This shows the correspondence of our generated proposals with the sequence of frames associated with the proposals. We mark each frame of a test video with all the generated and ground-truth action proposals. Viewing these videos can bring out the cases where non-existing actions are incorrectly detected and existing actions are missed.The demo videos we use are selected from two sources.First, from the public available datasets -Thumos14, ActivityNet and PKU-MMD -used in the paper.Second, we use videos taken from outside these datasets, to evaluate the Out-of-Distribution performance of our method. These demo videos can be found in the submission zip file. All of these video clips contain at least one of the following three scenarios: (1) Clips with no humans in them, (2) Clips with humans without movement, and (3) Clips with moving humans.From the results shown on these videos, and corresponding to the statistics of results in the main paper, we see that: (1) Our model can identify the clips with no humans in them. This is a result due to our human movement descriptors. In addition, (2) we also detect and skip the clips where the person is stationary, which is very likely due to the integration of human detection with the use of our semantic descriptors of human movement (which help distinguish between human movement, vs. still human+background movement). Finally, (3) we are able to detect and segment the clips with human action, and distinguish the actions therein. Although we do not aim at recognizing specifically what action it is, our model can distinguish between different human actions, and between human and background movements.

