CROSS-ATTENTIONAL AUDIO-VISUAL FUSION FOR WEAKLY-SUPERVISED ACTION LOCALIZATION

Abstract

Temporally localizing actions in videos is one of the key components for video understanding. Learning from weakly-labeled data is seen as a potential solution towards avoiding expensive frame-level annotations. Different from other works which only depend on visual-modality, we propose to learn richer audiovisual representation for weakly-supervised action localization. First, we propose a multi-stage cross-attention mechanism to collaboratively fuse audio and visual features, which preserves the intra-modal characteristics. Second, to model both foreground and background frames, we construct an open-max classifier which treats the background class as an open-set. Third, for precise action localization, we design consistency losses to enforce temporal continuity for the actionclass prediction, and also help with foreground-prediction reliability. Extensive experiments on two publicly available video-datasets (AVE and ActivityNet1.2) show that the proposed method effectively fuses audio and visual modalities, and achieves the state-of-the-art results for weakly-supervised action localization.

1. INTRODUCTION

The goal of this paper is to temporally localize actions and events of interest in videos with weaksupervision. In the weakly-supervised setting, only video-level labels are available during the training phase to avoid expensive and time-consuming frame-level annotation. This task is of great importance for video analytics and understanding. Several weakly-supervised methods have been developed for it (Nguyen et al., 2018; Paul et al., 2018; Narayan et al., 2019; Shi et al., 2020; Jain et al., 2020) and considerable progress has been made. However, only visual information is exploited for this task and audio modality has been mostly overlooked. Both, audio and visual data often depict actions from different viewpoints (Guo et al., 2019) . Therefore, we propose to explore the joint audio-visual representation to improve the temporal action localization in videos. A few existing works (Tian et al., 2018; Lin et al., 2019; Xuan et al., 2020) have attempted to fuse audio and visual modalities to localize audio-visual events. These methods have shown promising results, however, these audio-visual events are essentially actions that have strong audio cues, such as playing guitar, and dog barking. Whereas, we aim to localize wider range of actions related to sports, exercises, eating etc. Such actions can also have weak audio aspect and/or can be devoid of informative audio (e.g. with unrelated background music). Therefore, it is a key challenge to fuse audio and visual data in a way that leverages the mutually complementary nature while maintaining the modality-specific information. In order to address this challenge, we propose a novel multi-stage cross-attention mechanism. It progressively learns features from each modality over multiple stages. The inter-modal interaction is allowed at each stage only through cross-attention, and only at the last stage are the visuallyaware audio features and audio-aware visual features concatenated. Thus, an audio-visual feature representation is obtained for each snippet in videos. Separating background from actions/events is a common problem in temporal localization. To this end, we also propose: (a) foreground reliability estimation and classification via open-max classifier and (b) temporal continuity losses. First, for each video snippet, an open-max classifier predicts scores for action and background classes, which is composed of two parallel branches for action classification and foreground reliability estimation. Second, for precise action localization with weak supervision, we design temporal consistency losses to enforce temporal continuity of actionclass prediction and foreground reliability. We demonstrate the effectiveness of the proposed method for weakly-supervised localization of both audio-visual events and actions. Extensive experiments are conducted on two video datasets for localizing audio-visual events (AVEfoot_0 ) and actions (ActivityNet1.2foot_1 ). To the best of our knowledge, it is the first attempt to exploit audio-visual fusion for temporal localization of unconstrained actions in long videos.

2. RELATED WORK

Our work relates to the tasks of localizing of actions and events in videos, as well as to the regime of multi-model representation learning. 2020) proposed an expectation-maximization multi-instance learning framework where the key instance is modeled as a hidden variable. All these works have explored various ways to temporally differentiate action instances from the near-action background by exploiting only visual modality, whereas we additionally utilize audio modality for the same objective.

Audio-visual event localization:

The task of audio-visual event localization, as defined in the literature, is to classify each time-step into one of the event classes or background. This is different from action localization, where the goal is to determine the start and the end of each instance of the given action class. In (Tian et al., 2018) , a network with audio-guided attention was proposed, which showed prototypical results for audio-visual event localization, and cross-modality synchronized event localization. To utilize both global and local cues in event localization, Lin et al. (2019) conducted audio-visual fusion in both of video-level and snippet-level using multiple LSTMs. Assuming single event videos, Wu et al. (2019) detected the event-related snippet by matching the video-level feature of one modality with the snippet-level feature sequence of the other modality. Contrastingly, our cross-attention is over the temporal sequences from both the modalities and does not assume single-action videos. In order to address the temporal inconsistency between audio and visual modalities, Xuan et al. ( 2020) devised the modality sentinel, which filters out the eventunrelated modalities. Encouraging results have been reported, however, the localization capability of these methods has been shown only for the short fixed-length videos with distinct audio cues. Differently, we aim to fuse audio and visual modalities in order to also localize actions in long, untrimmed and unconstrained videos. Deep multi-modal representation learning: Multi-modal representation learning methods aim to obtain powerful representation ability from multiple modalities (Guo et al., 2019) . With the advancement of deep-learning, many deep multi-modal representation learning approaches have been developed. Several methods fused features from different modalities in a joint subspace by outerproduct (Zadeh et al., 2017 ), bilinear pooling (Fukui et al., 2016) , and statistical regularization (Aytar et al., 2017) . The encoder-decoder framework has also been exploited for multi-modal learning for image-to-image translation (Huang et al., 2018) and to produce musical translations (Mor et al., 



https://github.com/YapengTian/AVE-ECCV18 http://activity-net.org/download.html



-supervised action localization: Wang et al. (2017) and Nguyen et al. (2018) employed multiple instance learning (Dietterich et al., 1997) along with attention mechanism to localize actions in videos. Paul et al. (2018) introduced a co-activity similarity loss that looks for similar temporal regions in a pair of videos containing a common action class. Narayan et al. (2019) proposed center loss for the discriminability of action categories at the global-level and counting loss for separability of instances at the local-level. To alleviate the confusion due to background (nonaction) segments, Nguyen et al. (2019) developed the top-down class-guided attention to model background, and (Yu et al., 2019) exploited temporal relations among video segments. Jain et al. (2020) segmented a video into interpretable fragments, called ActionBytes, and used them effectively for action proposals. To distinguish action and context (near-action) snippets, Shi et al. (2020) designed the class-agnostic frame-wise probability conditioned on the attention using conditional variational auto-encoder. Luo et al. (

