FEATURE INTEGRATION AND GROUP TRANSFORMERS FOR ACTION PROPOSAL GENERATION Paper ID: 2544 Paper under double-blind review

Abstract

The task of temporal action proposal generation (TAPG) aims to provide highquality video segments, i.e., proposals that potentially contain action events. The performance of tackling the TAPG task heavily depends on two key issues, feature representation and scoring mechanism. To simultaneously take account of both aspects, we introduce an attention-based model, termed as FITS, to address the issues for retrieving high-quality proposals. We first propose a novel Feature-Integration (FI) module to seamlessly fuse two-stream features concerning their interaction to yield a robust video segment representation. We then design a group of Transformer-driven Scorers (TS) to gain the temporal contextual supports over the representations for estimating the starting or ending boundary of an action event. Unlike most previous work to estimate action boundaries without considering the long-range temporal neighborhood, the proposed action-boundary co-estimation mechanism in TS leverages the bi-directional contextual supports for such boundary estimation, which shows the advantage of removing several false-positive boundary predictions. We conduct experiments on two challenging datasets, ActivityNet-1.3 and THUMOS-14. The experimental results demonstrate that the proposed FITS model consistently outperforms state-of-the-art TAPG methods.

1. INTRODUCTION

Owing to the fast development of digital cameras and online video services, the rapid growth of video sequences encourages the research of video content analysis. The applications of interest include video summarization (Yao et al., 2015; 2016 ), captioning (Chen et al., 2019a;; Chen & Jiang, 2019 ), grounding (Chen et al., 2019b) , and temporal action detection (Gao et al., 2019; Zhang et al., 2019) . The temporal action detection task is an important topic related to several video content analysis methods, and it aims to detect the human-action instances within the untrimmed long video sequences. Like the image object detection task, the temporal action detection can be separated into a temporal action proposal generation (TAPG) stage and an action classification stage. Recent studies (Escorcia et al., 2016; Buch et al., 2017b; Lin et al., 2018; Liu et al., 2019; Lin et al., 2019; 2020) demonstrate that the way to pursue the proposal quality clearly improves the performance of two-stage temporal action detectors. To this end, a temporal action proposal generator is demanded to use a limited number of proposals for capturing the ground-truth action instances in a high recall rate, hence reducing the burden of the succeeding action classification stage. One popular way to tackle the TAPG task is to generate the proposals via the estimations of boundary and actioness probabilities. The boundary probability is usually factorized as the starting and ending for an action instance. Rather than directly estimating the actioness boundary as the existing methods, we leverage the actioness estimation and the additional background estimation in a bi-directional temporal manner to co-estimate the action boundaries. The background means existing no actions. This sort of boundary estimation derived from the observation that the features for describing the long-time actioness/background are more consistent along the temporal dimension than the short-time starting/ending. Therefore, estimating the boundary with the actioness and background features allows us to estimate the proposal boundaries of much less false-positive, hence obtaining the high-quality proposal candidates for further scoring. In practice, we estimate an action starting boundary as the time of descending background with simultaneous ascending actioness. In contrast, an action ending boundary occurs with the ascending background with descending actioness. Figure 1 illustrates our action-boundary co-estimation mechanism.

