FEATURE INTEGRATION AND GROUP TRANSFORMERS FOR ACTION PROPOSAL GENERATION Paper ID: 2544 Paper under double-blind review

Abstract

The task of temporal action proposal generation (TAPG) aims to provide highquality video segments, i.e., proposals that potentially contain action events. The performance of tackling the TAPG task heavily depends on two key issues, feature representation and scoring mechanism. To simultaneously take account of both aspects, we introduce an attention-based model, termed as FITS, to address the issues for retrieving high-quality proposals. We first propose a novel Feature-Integration (FI) module to seamlessly fuse two-stream features concerning their interaction to yield a robust video segment representation. We then design a group of Transformer-driven Scorers (TS) to gain the temporal contextual supports over the representations for estimating the starting or ending boundary of an action event. Unlike most previous work to estimate action boundaries without considering the long-range temporal neighborhood, the proposed action-boundary co-estimation mechanism in TS leverages the bi-directional contextual supports for such boundary estimation, which shows the advantage of removing several false-positive boundary predictions. We conduct experiments on two challenging datasets, ActivityNet-1.3 and THUMOS-14. The experimental results demonstrate that the proposed FITS model consistently outperforms state-of-the-art TAPG methods.

1. INTRODUCTION

Owing to the fast development of digital cameras and online video services, the rapid growth of video sequences encourages the research of video content analysis. The applications of interest include video summarization (Yao et al., 2015; 2016 ), captioning (Chen et al., 2019a;; Chen & Jiang, 2019) , grounding (Chen et al., 2019b) , and temporal action detection (Gao et al., 2019; Zhang et al., 2019) . The temporal action detection task is an important topic related to several video content analysis methods, and it aims to detect the human-action instances within the untrimmed long video sequences. Like the image object detection task, the temporal action detection can be separated into a temporal action proposal generation (TAPG) stage and an action classification stage. Recent studies (Escorcia et al., 2016; Buch et al., 2017b; Lin et al., 2018; Liu et al., 2019; Lin et al., 2019; 2020) demonstrate that the way to pursue the proposal quality clearly improves the performance of two-stage temporal action detectors. To this end, a temporal action proposal generator is demanded to use a limited number of proposals for capturing the ground-truth action instances in a high recall rate, hence reducing the burden of the succeeding action classification stage. One popular way to tackle the TAPG task is to generate the proposals via the estimations of boundary and actioness probabilities. The boundary probability is usually factorized as the starting and ending for an action instance. Rather than directly estimating the actioness boundary as the existing methods, we leverage the actioness estimation and the additional background estimation in a bi-directional temporal manner to co-estimate the action boundaries. The background means existing no actions. This sort of boundary estimation derived from the observation that the features for describing the long-time actioness/background are more consistent along the temporal dimension than the short-time starting/ending. Therefore, estimating the boundary with the actioness and background features allows us to estimate the proposal boundaries of much less false-positive, hence obtaining the high-quality proposal candidates for further scoring. In practice, we estimate an action starting boundary as the time of descending background with simultaneous ascending actioness. In contrast, an action ending boundary occurs with the ascending background with descending actioness. Figure 1 illustrates our action-boundary co-estimation mechanism. This paper introduces an effective temporal action proposal generator, i.e., FITS, which aims to provide the action proposals, that precisely and exhaustively cover the human-action instances. By considering the two essential TAPG issues, namely the feature representation and the scoring mechanism, and the above-mentioned action-boundary co-estimation, our attention-based FITS model comprises Feature Integration (FI) module and Transformer-driven Scorers (TS) module for dealing with these considerations. Precisely, our FI module enhances the common TAPG two-stream features (Simonyan & Zisserman, 2014; Wang et al., 2016; Xiong et al., 2016) by concerning the feature interaction. The previous TAPG methods usually directly concatenate the appearance stream and motion stream features for usage. In contrast, we were inspired by the non-local attention mechanisms (Wang et al., 2018; Hsieh et al., 2019) to extend such a long-range attention mechanism for integrating the two-stream features. As a result, our experiments show the robustness of the integrated features by reducing their mutual feature discrepancies. More importantly, to score the temporal action proposals for discriminating high-quality ones, we devise a novel transformer-driven scoring mechanism. The TS mechanism leverages the temporal contextual supports over the feature representations to obtain the self-attended representations and then associates these self-attended representations to co-estimate the action boundaries. The experiments show the retrieved action proposals containing much less false-positive ones. Figure 2 overviews our temporal action proposal model, termed as FITS network. To sum up, our main contributions include i) We introduce the novel feature integration module to integrate the two-stream features by reducing their feature discrepancies via non-local-style attention and obtaining robust representation. ii) We devise the novel transformer-driven scorers module to co-estimate the transformer-driven self-attended representations, which leverage long-range temporal contextual supports. Hence, we are able to retrieve high-quality temporal action proposals. iii) The extensive experiments demonstrate that the proposed FITS model achieves significantly better performance than current state-of-the-art TAPG methods.

2. RELATED WORK

Feature Representation. As a de facto trend, instead of using the handcrafted features, the neuralnetwork-based features are widely employed for addressing the action classification task. These popular neural network approaches include the two-stream networks (Simonyan & Zisserman, 2014; Feichtenhofer et al., 2016; Wang et al., 2016) , which separately represent the appearance feature and the motion feature, and 3D networks (Tran et al., 2015; Carreira & Zisserman, 2017; Qiu et al., 2017; Xu et al., 2017) , which directly represent a video sequence as the spatio-temporal feature. In this paper, we use the action recognition model (Wang et al., 2016; Xiong et al., 2016) to extract two-stream features for representing each untrimmed video sequence.



Figure 1: The proposed action-boundary co-estimation mechanism. Our transformer-driven scorers module estimates four boundary estimations of forward-actioness, backward-actioness, forwardbackground, and backward-background. The backward-actioness and forward-background coestimate the action starting, and the forward-actioness and backward-background co-estimate the action ending. The transformer-style units enable these temporal estimations to collect the temporal contextual supports over each inputted representation. The right-most figures show the estimations of action starting and action ending without (top) and with (bottom) the TS co-estimation. The proposed action-boundary co-estimation within our TS module is able to reduce more false-positive predictions.

