CROSS-ATTENTIONAL AUDIO-VISUAL FUSION FOR WEAKLY-SUPERVISED ACTION LOCALIZATION

Abstract

Temporally localizing actions in videos is one of the key components for video understanding. Learning from weakly-labeled data is seen as a potential solution towards avoiding expensive frame-level annotations. Different from other works which only depend on visual-modality, we propose to learn richer audiovisual representation for weakly-supervised action localization. First, we propose a multi-stage cross-attention mechanism to collaboratively fuse audio and visual features, which preserves the intra-modal characteristics. Second, to model both foreground and background frames, we construct an open-max classifier which treats the background class as an open-set. Third, for precise action localization, we design consistency losses to enforce temporal continuity for the actionclass prediction, and also help with foreground-prediction reliability. Extensive experiments on two publicly available video-datasets (AVE and ActivityNet1.2) show that the proposed method effectively fuses audio and visual modalities, and achieves the state-of-the-art results for weakly-supervised action localization.

1. INTRODUCTION

The goal of this paper is to temporally localize actions and events of interest in videos with weaksupervision. In the weakly-supervised setting, only video-level labels are available during the training phase to avoid expensive and time-consuming frame-level annotation. This task is of great importance for video analytics and understanding. Several weakly-supervised methods have been developed for it (Nguyen et al., 2018; Paul et al., 2018; Narayan et al., 2019; Shi et al., 2020; Jain et al., 2020) and considerable progress has been made. However, only visual information is exploited for this task and audio modality has been mostly overlooked. Both, audio and visual data often depict actions from different viewpoints (Guo et al., 2019) . Therefore, we propose to explore the joint audio-visual representation to improve the temporal action localization in videos. A few existing works (Tian et al., 2018; Lin et al., 2019; Xuan et al., 2020) have attempted to fuse audio and visual modalities to localize audio-visual events. These methods have shown promising results, however, these audio-visual events are essentially actions that have strong audio cues, such as playing guitar, and dog barking. Whereas, we aim to localize wider range of actions related to sports, exercises, eating etc. Such actions can also have weak audio aspect and/or can be devoid of informative audio (e.g. with unrelated background music). Therefore, it is a key challenge to fuse audio and visual data in a way that leverages the mutually complementary nature while maintaining the modality-specific information. In order to address this challenge, we propose a novel multi-stage cross-attention mechanism. It progressively learns features from each modality over multiple stages. The inter-modal interaction is allowed at each stage only through cross-attention, and only at the last stage are the visuallyaware audio features and audio-aware visual features concatenated. Thus, an audio-visual feature representation is obtained for each snippet in videos. Separating background from actions/events is a common problem in temporal localization. To this end, we also propose: 



(a) foreground reliability estimation and classification via open-max classifier and (b) temporal continuity losses. First, for each video snippet, an open-max classifier predicts

