CROSS-ATTENTIONAL AUDIO-VISUAL FUSION FOR WEAKLY-SUPERVISED ACTION LOCALIZATION

Abstract

Temporally localizing actions in videos is one of the key components for video understanding. Learning from weakly-labeled data is seen as a potential solution towards avoiding expensive frame-level annotations. Different from other works which only depend on visual-modality, we propose to learn richer audiovisual representation for weakly-supervised action localization. First, we propose a multi-stage cross-attention mechanism to collaboratively fuse audio and visual features, which preserves the intra-modal characteristics. Second, to model both foreground and background frames, we construct an open-max classifier which treats the background class as an open-set. Third, for precise action localization, we design consistency losses to enforce temporal continuity for the actionclass prediction, and also help with foreground-prediction reliability. Extensive experiments on two publicly available video-datasets (AVE and ActivityNet1.2) show that the proposed method effectively fuses audio and visual modalities, and achieves the state-of-the-art results for weakly-supervised action localization.

1. INTRODUCTION

The goal of this paper is to temporally localize actions and events of interest in videos with weaksupervision. In the weakly-supervised setting, only video-level labels are available during the training phase to avoid expensive and time-consuming frame-level annotation. This task is of great importance for video analytics and understanding. Several weakly-supervised methods have been developed for it (Nguyen et al., 2018; Paul et al., 2018; Narayan et al., 2019; Shi et al., 2020; Jain et al., 2020) and considerable progress has been made. However, only visual information is exploited for this task and audio modality has been mostly overlooked. Both, audio and visual data often depict actions from different viewpoints (Guo et al., 2019) . Therefore, we propose to explore the joint audio-visual representation to improve the temporal action localization in videos. A few existing works (Tian et al., 2018; Lin et al., 2019; Xuan et al., 2020) have attempted to fuse audio and visual modalities to localize audio-visual events. These methods have shown promising results, however, these audio-visual events are essentially actions that have strong audio cues, such as playing guitar, and dog barking. Whereas, we aim to localize wider range of actions related to sports, exercises, eating etc. Such actions can also have weak audio aspect and/or can be devoid of informative audio (e.g. with unrelated background music). Therefore, it is a key challenge to fuse audio and visual data in a way that leverages the mutually complementary nature while maintaining the modality-specific information. In order to address this challenge, we propose a novel multi-stage cross-attention mechanism. It progressively learns features from each modality over multiple stages. The inter-modal interaction is allowed at each stage only through cross-attention, and only at the last stage are the visuallyaware audio features and audio-aware visual features concatenated. Thus, an audio-visual feature representation is obtained for each snippet in videos. Separating background from actions/events is a common problem in temporal localization. To this end, we also propose: (a) foreground reliability estimation and classification via open-max classifier and (b) temporal continuity losses. First, for each video snippet, an open-max classifier predicts scores for action and background classes, which is composed of two parallel branches for action classification and foreground reliability estimation. Second, for precise action localization with weak supervision, we design temporal consistency losses to enforce temporal continuity of actionclass prediction and foreground reliability. We demonstrate the effectiveness of the proposed method for weakly-supervised localization of both audio-visual events and actions. Extensive experiments are conducted on two video datasets for localizing audio-visual events (AVEfoot_0 ) and actions (ActivityNet1.2foot_1 ). To the best of our knowledge, it is the first attempt to exploit audio-visual fusion for temporal localization of unconstrained actions in long videos.

2. RELATED WORK

Our work relates to the tasks of localizing of actions and events in videos, as well as to the regime of multi-model representation learning. 2020) proposed an expectation-maximization multi-instance learning framework where the key instance is modeled as a hidden variable. All these works have explored various ways to temporally differentiate action instances from the near-action background by exploiting only visual modality, whereas we additionally utilize audio modality for the same objective.

Audio-visual event localization:

The task of audio-visual event localization, as defined in the literature, is to classify each time-step into one of the event classes or background. This is different from action localization, where the goal is to determine the start and the end of each instance of the given action class. In (Tian et al., 2018) , a network with audio-guided attention was proposed, which showed prototypical results for audio-visual event localization, and cross-modality synchronized event localization. To utilize both global and local cues in event localization, Lin et al. (2019) conducted audio-visual fusion in both of video-level and snippet-level using multiple LSTMs. Assuming single event videos, Wu et al. (2019) detected the event-related snippet by matching the video-level feature of one modality with the snippet-level feature sequence of the other modality. Contrastingly, our cross-attention is over the temporal sequences from both the modalities and does not assume single-action videos. In order to address the temporal inconsistency between audio and visual modalities, Xuan et al. (2020) devised the modality sentinel, which filters out the eventunrelated modalities. Encouraging results have been reported, however, the localization capability of these methods has been shown only for the short fixed-length videos with distinct audio cues. Differently, we aim to fuse audio and visual modalities in order to also localize actions in long, untrimmed and unconstrained videos. Deep multi-modal representation learning: Multi-modal representation learning methods aim to obtain powerful representation ability from multiple modalities (Guo et al., 2019) . With the advancement of deep-learning, many deep multi-modal representation learning approaches have been developed. Several methods fused features from different modalities in a joint subspace by outerproduct (Zadeh et al., 2017), bilinear pooling (Fukui et al., 2016) , and statistical regularization (Aytar et al., 2017) . The encoder-decoder framework has also been exploited for multi-modal learning for image-to-image translation (Huang et al., 2018) and to produce musical translations (Mor et al., Our approach belongs to this category and uses cross-correlation as cross-modality constraint. Crosscorrelation has been exploited to generate visual features attended by text for visual question answering (Kim et al., 2017; Yu et al., 2017) . It has also been used to obtain cross-attention for few-shot learning (Hou et al., 2019) and image-text matching (Lee et al., 2018; Wei et al., 2020) . In our work, we adopt the cross-correlation to generate both of audio and visual features attended by each other. The most similar to our cross-attention mechanism is the cross-attention module of Hou et al. (2019) , which computes cross-correlation spatially between features maps of two images (sample and query). Whereas, our cross-attention is designed for video and is computed between two temporal sequences of different modalities.

3. METHODOLOGY

In this section, we introduce the proposed framework for weakly-supervised action and event localization. Fig. 1 illustrates the complete framework. We first present the multi-stage cross-attention mechanism to generate the audio-visual features in Sec. 3.1. Then, we explain open-max classification to robustly distinguish the actionsfoot_2 from unknown background in 3.2. Finally, in Sec. 3.3, we describe the training loss including two consistency losses designed to enforce temporal continuity of the actions and background. Problem statement: We suppose that a set of videos only with the corresponding video-level labels are given for training. For each video, we uniformly sample L non-overlapping snippets, and then extract the audio features U = (u l ) L l=1 ∈ R du×L with a pre-trained network, where u l is the d u dimensional audio feature representation of the snippet l. Similarly, the snippet-wise visual features V = (v l ) L l=1 ∈ R dv×L are also extracted. The video-level label is represented as c ∈ {0, 1, . . . , C}, where C is the number of action classes and 0 denotes the background class. Starting from the audio and visual features, our approach learns to categorize each snippet into C + 1 classes and hence localizes actions in weakly-supervised manner.

3.1. MULTI-STAGE CROSS-ATTENTION MECHANISM

While multiple modalities can provide more information than a single one, the modality-specific information may be reduced while fusing different modalities. To reliably fuse the two modalities, we develop the multi-stage cross-attention mechanism where features are separately learned for each modality under constraints from the other modality. In this way, the learned features for each modality encodes the inter-modal information, while preserving the exclusive and meaningful intramodal characteristics. As illustrated in Fig. 1 , we first encode the input features U and V to X u = (x l u ) L l=1 and X v = (x l v ) L l=1 via the modality-specific fully-connected (FC) layers f u and f v , where x l u and x l v are in R dx . After that, we compute the cross-correlation of X u and X v to measure inter-modal relevance. To reduce the gap of the heterogeneity between the two modalities, we use a learnable weight matrix W ∈ R dx×dx and compute the cross-correlation as Λ = X T u W X v where Λ ∈ R L×L . Note that x l u and x l v are l 2 -normalized before computing the cross-correlation. In the cross-correlation matrix, a high correlation coefficient means that the corresponding audio and visual snippet features are highly relevant. Accordingly, the lth column of Λ corresponds to the relevance of x l v to L audio snippet features. Based on this, we generate the cross-attention weights A u and A v by the column-wise soft-max of Λ and Λ T , respectively. Then, for each modality, the attention weights are used to re-weight the snippet features to make them more discriminative given the other modality. Formally, the attention-weighted features Xu and Xv are obtained by Xu = X u A u and Xv = X v A v . Note that each modality guides the other one through the attention weights. This is to ensure the meaningful intra-modal information is well-preserved while applying the cross-attention. To extensively delve into cross-modal information, we repeatedly apply the cross-attention multiple times. However, during the multi-stage cross-attention, the original modality-specific characteristics may be over-suppressed. To prevent this, we adopt the dense skip connection (Huang et al., 2017) . More specifically, at stage t, we obtain the attended audio features by X (t) att,u = tanh( t-1 i=0 X (i) att,u + X(t) u ) where X (0) att,u is X u , and tanh(•) denotes the hyperbolic tangent activation function. Similar to X (t) att,u , the attended visual features X (t) att,v are generated for the visual modality. At the last stage t e , we concatenate the attended audio and visual features to yield audio-visual features, X att = [ X (te) att,u ; X (te) att,v ] ) where t e is empirically set to 2 which will be discussed in the ablation studies in Section 4.3. Discussion Applying the cross-attention (Eq. 2) brings the audio and visual embeddings closer, while the skip connections (Eq. 3) enforce modality specific information, more so with dense skip connections. Using both the cross-attention and the dense skip connections alternatively over multiple stages, we progressively learn optimal embeddings for fusion. Learning in this way, we aim to achieve right amount of compatibility between the two embeddings while preserving the modality specific information, in order to optimize for the training objective.

3.2. OPEN-MAX CLASSIFICATION

Video segments can be dichotomized into foreground actions and background. For precise action localization, distinguishing the background from the actions is important as well as categorizing the action classes. However, unlike action classes, the background class comprises of extremely diverse types of non-actions. Therefore, it is not possible to train for the wide range of background classes that the model may confront at the test time. ac by soft-max function. Simultaneously, the second FC layer is applied on x l att , followed by a sigmoid function to estimate its foreground reliability, µ l ∈ [0, 1]. The foreground reliability, µ l , is the probability of snippet l belonging to any action class. The low reliability indicates that no action occurs in the snippet. Therefore, we compute the probability for the background class as the complement of µ l , by p l bg = 1 -µ l . Lastly, the open-max classifier outputs the probability distribution p l over C + 1 classes including the background and C actions as p l = [p l bg ; µ l p l ac ].

3.3. TRAINING LOSS

Next, we describe the loss functions to train our model. The actions or foreground do not abruptly change over time. To impose this constraint, we devise two types of temporal continuity losses. Foreground continuity loss: Foreground continuity implies two important properties for neighboring snippets: (a) similar foreground reliability in a class-agnostic way, and (b) consistent open-max probabilities for a target foreground class. The first of the two constraints is imposed via class-agnostic foreground continuity: µ l ag = 1 B + 1 B/2 i=-B/2 G(i) µ l-i where G(i) is a Gaussian window of width B + 1 to apply temporal smoothing over foreground reliability around lth snippet. For the second constraint, temporal Gaussian smoothing is applied over open-max probability of video-level ground-truth action class, ĉ, to obtain class-specific foreground continuity: µ l sp = 1 B + 1 B/2 i=-B/2 G(i) p l-i (ĉ) Finally, the foreground continuity loss is defined as: L cont = 1 L L l=1 |µ l -µ l ag | + |µ l -µ l sp |. The foreground continuity loss imposes temporal continuity of foreground, and hence also helps in separating background from the action classes. Pseudo localization loss: Here, we consider the action or background class continuity, which implies that the open-max probabilities, p l , agrees with the classification of neighbouring snippets. This can be used to obtain the pseudo label for snippet l. We first average the open-max prediction of N neighbor snippets and itself, q l = 1 N +1 l+N/2 i=l-N/2 p i . We set ĉl = arg max c (q l (c)) as the pseudo label, but only retain it when the largest class probability of q l is higher than a predefined threshold τ . Accordingly, the pseudo localization loss is formulated by L pseu = 1 L L l=1 1(max(q l ) ≥ τ )(-log p l (ĉ l )) Total loss: Additionally, we employ the multiple instance learning (MIL) and co-activity similarity (CAS) losses (Paul et al., 2018) . The final loss L is defined by L = L mil + αL cas + βL cont + γL pseu (10) where L mil and L cas denote MIL and CAS losses, respectively. For details see Appendix D. Figs. 2 (b ) and (c) compare the class activation sequences along the temporal axis for the target classes between the models trained without and with the two consistency losses, respectively. We see that class activations are more continuous in the model with the consistency losses.

4. EXPERIMENTS

In this section, we provide experimental analysis and comparative evaluation to show the effectiveness of the proposed method. More experiments and qualitative results are in the Appendix.

4.1. DATASETS AND EVALUATION METHOD

Datasets: We evaluate our approach on Audio-Visual Event (AVE) and ActivityNet1.2 datasets. AVE dataset is constructed for audio-visual event localization, which contains 3,339 training and 804 testing videos, each lasting 10 seconds with event annotation per second. There are 28 audiovisual event categories covering a wide range of domains, such as animal and human actions, vehicle sounds, and music performance. Each event category has both audio and visual aspects, e.g. church bell, baby crying, man speaking etc. ActivityNet1.2 is a temporal action localization dataset with 4,819 train and 2,383 validation videos, which in the literature is used for evaluation. It has 100 action classes of wider variety than AVE dataset, with on an average 1.5 instances per video. The average length of the videos in this dataset is 115 seconds, often with weak audio cues, which makes action localization as well as leveraging audio harder. Evaluation metric: We follow the standard evaluation protocol of each dataset. For the AVE dataset, we report snippet-wise event prediction accuracy. For the ActivityNet1.2 dataset, we generate the action segments (start and end time) from snippet-wise prediction (details are described in the following section), and then measure mean average precision (mAP) at different intersection over union (IoU) thresholds. For action localization, we follow the two-stage thresholding scheme of (Paul et al., 2018) . The first threshold is applied to filter out the classes that have videolevel scores less than the average over all the classes. The second threshold is applied along the temporal axis to obtain the start and the end of each action instance.

Multi-stage cross-attention:

To evaluate the effectiveness of the multi-stage cross-attention in audio-visual fusion, we compare two uni-modal methods (audio or visual) and four multi-modal methods with different stages (0-3 stages) on the AVE and ActivityNet1.2 datasets in Table 1 . The pseudo-label losses and the open-max classifiers are used in all six cases. In the uni-modal methods, the input feature is embedded using an FC layer, and then fed into the open-max classifier. The 0-stage method denotes a naive fusion, where audio and visual features are fused by simple con- catenation. Even this naive fusion yields higher performance than the uni-modal methods on the AVE dataset. However, that is not the case with more challenging task of the action localization on ActivityNet1.2 dataset. Furthermore, all the later stages improve considerably over 0-stage and the uni-modal cases, for the both datasets. The 2-stage cross-attention achieves the best performance for the both datasets (more in Appendix A). Interestingly, even with the minimal audio cue in Ac-tivityNet1.2 (avg. mAP of audio only is 7.8%), the proposed audio-visual features improve the avg. mAP over visual-only and naive fusion (0-stage) models by 4%. Fig. 3 shows the qualitative results of the proposed and visual-only models given an example of the ActivityNet1.2 dataset. At the beginning of the video, a performer is shown without any activity. The visual-only model incorrectly predicts the beginning part as a target action while our proposed model correctly predicts it as background. Also, the visual-only model cannot catch the action at the last part of the video since it is visually similar across the frames and has minimal visual activity. Whereas, our model correctly recognizes the last part as an action, owing to the multi-stage crossattention of effective fusion of the two modalities. More qualitative results are in Appendix E.

Consistency losses:

We show the ablation over the two proposed losses, L cont and L pseu , while using Open-Max classifier and 2-stage cross-attention, in the lower part of the Table 2 . We denote the method with only L cont loss by O-I and with only L pseu loss by O-II. The proposed method (O-III) with both of the losses performs the best suggesting the importance of both of the losses. Further, O-II outperforms O-I by a big margin on both the datasets, implying that the pseudo localization loss is more critical for the action localization (more in Appendix B.1). This result demonstrates that guiding temporal continuity is essential in the long untrimmed videos as well as the short ones.

Open-max classifier:

We compare the open-max classifier with the soft-max classifier where the last FC layer outputs activations for C + 1 classes are normalized by the soft-max function. As the background is considered a closed set in the soft-max approach, the foreground continuity loss is not available. The soft-max is denoted by S-I in Table 2 . Both O-II and O-III versions of the open-max outperform the S-I method with the soft-max. The O-III method improves the accuracy by 8.6% on the AVE dataset and the avg. mAP by 2.2% on the ActivityNet1.2 dataset. For further analysis see Appendix B.2. This shows the advantage of modelling background with the open-max classifier. Dense skip connections: We evaluate the impact of dense skip connections in Table 3 for 2-stage model on the ActivityNet1.2. Compared to no skip connection, performance is improved with the skip connections, and further boosted with the dense skip connection to avg. mAP of 26.0%. This shows preserving the modality specific information leads to better fusion and action localization.

4.4. MODEL EFFICIENCY

Though we successfully leverage the audio modality to improve action localization performance, the added modality leads to increased computational cost. The trade-off between efficiency and performance due to the fusion with audio modality is demonstrated in Table 4 . When using feature dimension, d x =1024, the fusion increases the computation over visual-only method by about 52% and 74% after 1-stage and 2-stage, respectively. When we reduce d x to 512, the visual-only model gets affected while the 2-stage model maintains its performance at 25.9%. Thanks to the effectiveness of the proposed fusion, even with smaller d x its avg. mAP is well above that of video-only model with d x = 1024, while using about 26% less computation (1.7 MFLOPS vs 2.3 MFLOPS). 

4.5. COMPARISON WITH THE STATE-OF-THE-ART

Audio-visual event localization: In Table 5 , we compare the proposed method with the recent fully and weakly-supervised methods on the AVE dataset for audio-visual event localization task. In the weakly-supervised setting, our method performs better than all of the existing methods at least by 1.4%. Note that, even though learned in weak-supervision, our approach achieves a comparable accuracy (77.1%) to the fully-supervised accuracy of the state-of-the-art method (Xuan et al., 2020). Temporal action localization: In Table 6 , we apply the proposed method to weakly-supervised action localization in long duration videos of the ActivityNet1.2 dataset. We report results for our method as well as its efficient version from Section 4.4. The mAP scores at varying IoU thresholds are compared with the current state-of-the-art methods. Both our method and its efficient version achieve the highest mAPs for 8 out of 10 IoU thresholds, and outperform all of the previous methods with the avg. mAP of 26.0%. We also significantly outperform the audio-visual based method of Tian et al. ( 2018) by the avg. mAP of 17.2%. Additionally, we compare with two naive fusions without the cross-attention (0-stage, SoftMax) with and without the continuity losses (denoted as CL in the Table ), both are bettered comfortably by our method. This demonstrates that the effective fusion of audio and visual modalities is critical for action localization. Furthermore, our approach is even comparable to the fully-supervised method in (Zhao et al., 2017).

5. CONCLUSION

We presented a novel approach for weakly-supervised temporal action localization in videos. In contrast to other methods, we leveraged both audio and visual modalities for this task. This is the first attempt at audio-visual localization of unconstrained actions in long videos. To collaboratively fuse audio and visual features, we developed the multi-stage cross-attention mechanism that also preserves the characteristics specific to each modality. In Table 9 , we conduct the analysis for the consistency losses for 0, 1 and 3-stage models as well as the chosen 2-stage model. Effect of losses on different stage models: The impact of continuity losses is analogous on 1-, 2and 3-stage models. Each of the two continuity losses help, but the pseudo localization loss (L pseu ) is more effective. Also, there is further benefit of using them together for almost all the IoU thresholds and stages. In 0-stage model, i.e. without the cross-attention, O-II shows the highest snippet-level performance on the AVE dataset, but the lowest temporal action localization performance on the ActivityNet1.2 dataset. From this, we understand that L pseu has difficulty in achieving continuity when audio and visual features are overly heterogeneous. Consequently, clear benefit is observed when the cross-attention is used. Interdependence of cross-attention and pseudo localization loss: When comparing the O-I of all 0-3 stage models, we see that the performance improvement by stacking the cross-attention is marginal, and the pseudo localization is critical to the performance. This follows from Eq. 9, where L pseu is only activated at snippet l when classification over its neighboring snippets does not strongly agree on the action class or background. To analyze this, we check how frequently L pseu is activated when cross-attention is not used and when it is used. For 0-stage model, without continuity losses, L pseu is activated on 11.1% snippets of the ActivityNet1.2 training set. The same frequency is 38.2% for 2-stage model, again without the continuity losses. This shows that when the cross-attention is used, more often the open-max classification of a snippet fails to strongly agree with its neighbors. Therefore, the pseudo localization loss is much needed to enforce the continuity. In Fig. 4 (c), the brushing sound is overlapped with the loud human narration lasting throughout videos. Nevertheless, the proposed method effectively extracts the crucial audio cues and fuses them with the visual ones. In Fig. 4 (d), even though the early part of the action is visually occluded by large logos, our method exactly localizes the action. Also, for all of the class activation sequences, the activations by the proposed method are more consistently high for actions. This means that our collaboration of audio and visual modalities is more robust in distinguishing foreground from background. Fig. 5 illustrates the cases where audio degrades the performance. Fig. 5 (a) shows an example video for action class 'playing violin'. The violin sound of the soloist and the band is intermingled in the video. In the end, the sound of violin continues making our model predict the action but since camera focuses on the band, the ground-truth does not include those frames. Fig. 5 (b) shows an example of action 'using parallel bars'. Here the repeated background music is irrelevant to action,



https://github.com/YapengTian/AVE-ECCV18 http://activity-net.org/download.html For brevity we refer both action and event as action.



-supervised action localization: Wang et al. (2017) and Nguyen et al. (2018) employed multiple instance learning (Dietterich et al., 1997) along with attention mechanism to localize actions in videos. Paul et al. (2018) introduced a co-activity similarity loss that looks for similar temporal regions in a pair of videos containing a common action class. Narayan et al. (2019) proposed center loss for the discriminability of action categories at the global-level and counting loss for separability of instances at the local-level. To alleviate the confusion due to background (nonaction) segments, Nguyen et al. (2019) developed the top-down class-guided attention to model background, and (Yu et al., 2019) exploited temporal relations among video segments. Jain et al. (2020) segmented a video into interpretable fragments, called ActionBytes, and used them effectively for action proposals. To distinguish action and context (near-action) snippets, Shi et al. (2020) designed the class-agnostic frame-wise probability conditioned on the attention using conditional variational auto-encoder. Luo et al. (

Figure 1: The proposed architecture has two parts: modality fusion and open-max classification. (a) Fusion by multi-stage cross-attention: The input audio (U ) and visual (V ) features are embedded by the two fully-connected layers f u and f v , and passed through the multiple stages of the crossattention. At the tth stage, the attended audio-visual embeddings, X (t) att,u and X (t) att,v , are calculated using the results from the previous stages through dense skip connections, and activated by a nonlinear function. Here, c and + denote concatenation and summation operations. This figure shows 2-stage case. The dense skip connections of two stages are depicted as green and yellow arrows, respectively. At the last stage, two attended features are concatenated. (b) Open-max classifier takes the concatenated audio-visual features as input and generates classification scores for action classes and background. More detailed description is given in Appendix C.

To resolve this problem, we address the background as an open set (Dietterich, 2017; Bendale & Boult, 2016). As illustrated in Fig. 1, we construct an open-max classifier on top of the multistage cross-attentional feature fusion. Specifically, the open-max classifier consists of two parallel

Figure 2: Visualization of class activation sequences for the target actions in two example videos: The ground-truth segments are shown in (a). The class activation sequences obtained without L cont and L pseu are shown in (b), which improve and get better aligned to the ground-truth segments when these continuity losses are used as shown in (c). The activation is depicted in gray-scale, where lower intensity indicates more strong activation.

Ablations for the consistency losses and open-max classifier. Consistency losses: The lower part of the table shows the impact of each of the two consistency losses, when used with the open-max classifier. Open-max vs soft-max: The results for the soft-max are also shown, which demonstrates the advantage of foreground/background modelling by the open-max classification on both the datasets. The model with 2-stage cross-attention is used.

Figure 3: Visualization of the action localization result for an example video from ActivityNet1.2. The ground truth is shown in (a), highlighted in green. The localization and the class activation sequence of the visual-only model are shown in (b) and (c), respectively. Finally, the localization and the class activation sequence for the proposed audio-visual method are shown in (d) and (e).

EXTRACTION AND IMPLEMENTATION DETAILS Feature extraction: We use the I3D network (Carreira & Zisserman, 2017) and the ResNet152 architecture (He et al., 2016) to extract the visual features for ActivityNet1.2 and AVE, respectively. The I3D network is pre-trained on Kinetics-400 (Kay et al., 2017), and the features consist of two components: RGB and optical flow. The ResNet 152 is pre-trained on the ImageNet (Russakovsky et al., 2015), and the features are extracted from the last global pooling layer. To extract the audio features, we use the VGG-like network (Hershey et al., 2017), which is pre-trained on the AudioSet (Gemmeke et al., 2017), for both AVE and ActivityNet1.2 datasets.Implementation Details: We set d x to 1,024, and the LeakyRelu and hyperbolic tangent functions are respectively used for the activation of modality-specific layers and cross-attention modules. In training, the parameters are initialized with Xavier method(Glorot & Bengio, 2010) and updated by Adam optimizer (Kingma & Ba, 2015) with the learning rate of 10 -4 and the batch size of 30. Also, the dropout with a ratio of 0.7 is applied for the final attended audio-visual features. In the loss, the hyper parameters are set as B = 4, α = 0.8, β = 0.8 and γ = 1.Localization at test time: For event localization at test time, i.e. snippet classification, each snippet l is classified into one of event classes (including background) by arg max c p l (c), where p l is the open-max probability of snippet l.

Figure 4: Qualitative results for action localization. Ground-truth (green), prediction by the visualonly method (orange), and prediction by the proposed method (blue) are shown. Class activation sequences are visualized below each prediction, darker shade means higher activation.

Ablation for multi-stage cross-attention. The results for different stages of the crossattention are reported for the AVE and ActivityNet1.2 datasets. The comparison with the uni-modal approach shows the impact of leveraging the multi-modality and the cross-attention.

Impact of dense skip con-

Comparison of the number of FLOPS and the average mAP@[0.5:0.05:0.95] on the ActivityNet1.2 dataset for visual-only, 1-stage, and 2-stage models. d x × d x are the dimensions for the cross-correlation matrix W .

Comparison of the proposed method with the state-of-the-art fully and weakly-supervised methods (separated by '/') on the AVE dataset. Snippet-level accuracy (%) is reported.

Comparison of our method with the state-of-the-art action localization methods on the ActivityNet1.2 dataset. The mAPs (%) at different IoU thresholds and average mAP across the IoU thresholds are reported. † indicates audio-visual models. experiment done using author's code.

We proposed to use the open-max classifier to model the action foreground and background, in absence of temporal annotations. Our model learns to classify video snippets via two consistency losses that enforce continuity for foreground reliability and open-max probabilities for action classes and the background. We conducted extensive experiments to analyze each of the proposed components and demonstrate their importance. Our method outperforms the state-of-the-art results on both AVE and ActivityNet1.2 datasets. Tan Yu, Zhou Ren, Yuncheng Li, Enxu Yan, Ning Xu, and Junsong Yuan. Temporal structure mining for weakly supervised action detection. In ICCV. 2019. 2 Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In ICCV. 2017. 3 Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Tensor fusion network for multimodal sentiment analysis. In EMNLP. 2017. 2 Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal action detection with structured segment networks. In ICCV. 2017. 9 setting (d x,u = 1024,dx,v = 1024) of 2-stage model even with more parameters. Instead of increasing the parameters in 1-stage model itself, when an additional stage is added (i.e. a weight matrix learned with a non-linear activation function) better performance is achieved. Indeed, it is often not trivial to replace a sequence of non-linear functions with another non-linear function as we experimentally observe here. The intention behind the multi-stage is also to extensively delve into cross-modal information, progressively learning the embeddings for each modality.

For 0-3 stage models, ablation analysis on consistency losses for open-max (O-0, O-I, O-II, and O-III) classifiers on the AVE and the ActivityNet1.2 datasets. O-III of 2-stage is the proposed. ANALYSIS OF LOSSES AND OPEN-MAX CLASSIFICATION ON 2-STAGE MODEL In Table 10, we conduct more extensive analysis for the consistency losses and the open-max classifier. Specifically, we replace the open-max classification approach with soft-max one. Then, for both classifiers with the 2-stage cross-attention, we ablate the foreground continuity or pseudo local-

A ANALYSIS ON MULTI-STAGE CROSS-ATTENTION

In this section, we conduct extensive analysis for the impact of the multiple stages and dense skip connection of the proposed cross-attention mechanism. Tables 7 and 8 show the experimental results.Training multiple stages of cross-attention: As shown in the Table 1 , the 3-stage model suffers from performance drop. To analyze this, in Table 7 , we compare 2-and 3-stage models on each of 'w/o skip connection', 'w/ skip connection', and 'w/ dense skip connection'. Without the skip connection, 3-stage model improves over 2-stage model, which is intuitively expected. With the skip connection, avg. mAP of 3-stage model drops compared to 2-stage model, from 24.9% to 23.2%. But, when the third stage is appended to the trained (and now frozen) stages of 2-stage model, the avg. mAP is maintained at 24.9%. Similarly, with the dense skip connection, training the entire 3-stage model end-to-end leads to degraded performance. But, when training the model frozen till the second stage the drop is much less. The fact that, in 3-stage model, better performance is obtained when training with first two stages frozen compared to training end-to-end, shows that the optimization gets hard in the latter. Therefore, we conclude that though the third stage helps without the skip connections, due to harder optimization with more stages and (dense) skip connections, 2-stage model is the optimal choice. (1)att,v ∈ R dx×L W (2) dx × dx X(1)att,u , X

D MULTIPLE INSTANCE LOSS AND CO-ACTIVITY SIMILARITY LOSS

We apply multiple-instance learning loss for classification. The prediction score corresponding to a class is computed as the average of its top k activations over the temporal dimension. Co-activity similarity loss (CASL) (Paul et al., 2018) is computed over two snippet sequences from a pair of videos, to have higher similarity when the videos have a common class.

E QUALITATIVE EVALUATION

We provide additional qualitative results for action localization on the ActivityNet1.2 dataset. Fig. 4 compares the proposed method with the method trained on visual modality ('Visual-only'). The therefore the class activation is bit off in the last part. However, thanks to the visual modality, the prediction is still reasonable.

