CONTRASTIVE AUDIO-VISUAL MASKED AUTOENCODER

Abstract

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments show that the contrastive audio-visual correspondence learning objective not only enables the model to perform audio-visual retrieval tasks, but also helps the model learn a better joint representation. As a result, our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet in the audio-visual event classification task. Code and pretrained models are at https://github.com/yuangongnd/cav-mae.

1. INTRODUCTION

Acoustic and visual modalities have different properties, yet humans are able to seamlessly connect and integrate them to perceive the world. Developing learning algorithms to replicate these abilities, especially for multi-modal audio-visual fusion and retrieval is of great interest. Since manually annotating audio and video is expensive and difficult to scale, how to utilize web-scale unlabeled video data in a self-supervised manner has become a core research question. One major line of audio-visual self-supervised learning research is leveraging the natural audiovisual correspondences found in videos. Among numerous ways to use such correspondences, Contrastive Audio-Visual Learning has shown to be a simple yet effective approach (Arandjelovic & Zisserman, 2018; Morgado et al., 2021b; Rouditchenko et al., 2021) . It learns coordinatedfoot_0 representations that are closer for paired audio and visual samples than for mismatched samples. Such coordinated representations are particularly useful for tasks such as cross-modal retrieval. Another vetted commonly used self-supervised learning framework is Masked Data Modeling (MDM), which learns a meaningful representation with the pretext task of recovering the original inputs or features from the corrupted ones (Devlin et al., 2019) . Particularly, based on the Audio Spectrogram Transformer (Gong et al., 2021a) and Vision Transformer (Dosovitskiy et al., 2020) backbones, the single-modal Masked Auto-Encoder (MAE) (He et al., 2022) achieved state-of-theart (SOTA) performance on images and audio tasks (Huang et al., 2022a) individually. Inspired by these advances, we propose to extend the single-modal MAE to Audio-Visual Masked Auto-Encoder (AV-MAE), aiming to learn a joint representation that fuses the unimodal signals. Although these two major self-supervised frameworks have been widely used individually, to the best of our knowledge, they have never been combined in audio-visual learning. In fact, we find they are complementary: Contrastive audio-visual learning explicitly leverages the very useful audiovisual pair information, but it could discard modality-unique information that is useful in downstream tasks; The reconstruction task of AV-MAE forces its representation to encode the majority of the input information in the fusion, but it lacks an explicit audio-visual correspondence objective. This motivates us to design the Contrastive Audio-Visual Masked Autoencoder (CAV-MAE) that integrates contrastive learning and masked data modeling which learns a joint and coordinated audio-visual representation with a single model. Our experiments support our design: on audiovisual event classification, CAV-MAE significantly outperforms baseline models trained with only contrastive or masked data modeling objectives, demonstrating that the two objectives are complementary in learning a strong joint audio-visual representation. As a result, CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet. Moreover, when it comes to audio-visual retrieval, CAV-MAE also performs equally well or even better than models trained with only the contrastive objective, which demonstrates that CAV-MAE can learn both a joint and coordinated representation well. Finally, CAV-MAE multi-modal pretraining improves single-modal performance, consequently, CAV-MAE achieves a new SOTA for audio-based event classification on AudioSet-20K and VGGSound. In summary, our contributions are: (1) We extend the single-modal MAE to multi-modal AV-MAE, which fuses audio-visual inputs for self-supervised learning through cross-modal masked data modeling; (2) More importantly, we investigate how to best combine contrastive audio-visual learning with masked data modeling and propose CAV-MAE; (3) We demonstrate that contrastive and masked data modeling objectives are complementary. As a result, CAV-MAE matches or outperforms SOTA models on audio-visual classification.

2. CONSTRASTIVE AUDIO-VISUAL MASKED AUTOENCODER

2.1 PRELIMINARIES

2.1.1. AUDIO AND IMAGE PRE-PROCESSING AND TOKENIZATION

As depicted in Figure 1 (A), we follow pre-processing and tokenization in AST (Gong et al., 2021a) and ViT (Dosovitskiy et al., 2020) for audio and image inputs, respectively. Specifically, we use 10-second videos (with parallel audios) in AudioSet (Gemmeke et al., 2017) and VGGSound (Chen et al., 2020) to pretrain and fine-tune the model. For audio, each 10-second audio waveform is first converted to a sequence of 128-dimensional log Mel filterbank (fbank) features computed with a 25ms Hanning window every 10ms. This results in a 1024(time) × 128(frequency) spectrogram. We then split the spectrogram into 512 16 × 16 square patches a = [a 1 , ..., a 512 ] as the input of the model. Processing video with Transformer models is expensive and typically requires industriallevel computation resources. To lower the computational overhead and fit our resources, we use a frame aggregation strategy. Specifically, we uniformly sample 10 RGB frames from each 10-second video (i.e., 1 FPS). During training, we randomly select one RGB frame as the input; during inference, we average the model prediction of each RGB frame as the video prediction. Compare with concatenating multiple RGB frames as the input of the Transformer that has a quadratic complexity (e.g., in Nagrani et al. (2021) ), frame aggregation is much more efficient with a linear complexity in time at a cost of not considering inter-frame correlation. For each RGB frame, we resize and center crop it to 224 × 224, and then split it into 196 16 × 16 square patches v = [v 1 , ..., v 196 ].

2.1.2. THE TRANSFORMER ARCHITECTURE

Throughout this paper, we use the standard Transformer (Vaswani et al., 2017) as our main model component. Each Transformer layer consists of multi-headed self-attention (MSA), layer normalization (LN), and multilayer perceptron (MLP) blocks with residual connections. Specifically, we denote a Transformer layer y = Transformer(x; MSA, LN1, LN2, MLP) as: x ′ = MSA(LN 1 (x)) + x; y = MLP(LN 2 (x ′ )) + x ′ (1) where MSA computes dot-product attention of each element of x and thus has a quadratic complexity w.r.t. to the size of x. Please refer to Vaswani et al. (2017) for further details on Transformers.

2.1.3. CONTRASTIVE AUDIO-VISUAL LEARNING (CAV)

The natural pairing of audio and visual information in videos is a useful signal for learning audiovisual representations through self-supervision. A conventional CAV model is shown in Figure 1 .B (top), for a mini-batch of N audio-visual pair samples, we first pre-process and tokenize the audios 

Pooling

Figure 1 : An illustration of our method. A) We tokenize audio spectrograms and RGB images into 16×16 square patches and use them as the input to all models. B) Conventional contrastive audiovisual learning model (top) and vanilla audio-visual masked auto-encoder (bottom, also novel and first introduced in this paper). C) Our proposed contrastive audio-visual masked auto-encoder (CAV-MAE) model. CAV-MAE integrates two major self-supervised frameworks: contrastive audio-visual learning and cross-modal masked data modeling, which learns a joint and coordinate representations and performs well on both multi-modal joint classification tasks and cross-modal retrieval tasks. and images and get a sequence of audio and visual tokens {a i , v i } for each sample i. We then input a i and v i to independent audio and visual Transformer encoders E a (•) and E v (•), respectively, and get the mean pooled audio and visual representation c a i and c v i , i.e., c a i = MeanPool(E a (Proj a (a i )) and c v i = MeanPool(E v (Proj v (v i )) , where Proj a and Proj v are linear projections that maps each audio and visual token to R 768 . We then apply a contrastive loss (Equation 7) on c a i and c v i .

2.1.4. SINGLE MODALITY MASKED AUTOENCODER (MAE)

Another line of major self-supervised frameworks is masked data modeling (MDM). Among numerous variants of MDM (e.g., Bao et al. (2021) ; Wei et al. (2022) ), the masked auto-encoder (MAE) is a simple yet effective approach. For an input sample x that can be tokenized as x = [x 1 , x 2 , ..  ′ = Proj v (v) + E v + E p v . We concatenate a ′ and v ′ and construct a joint embedding x = [a ′ , v ′ ]. We then mask a portion (75%) of x and only input unmasked tokens x unmask = x \ x mask to an audio-visual joint encoder E j (•) and get the output x ′ unmask . After that, we pad x ′ unmask with trainable masked tokens at their original position as x ′ . Again, we also add modality type embedding E ′ a and E ′ v and modality-specific 2-D sinusoidal positional embedding E p a ′ and E p v ′ before feeding x ′ to a joint audio-visual decoder D j (•) to reconstruct the input, i.e., â, v = D j (x ′ + [E ′ a , E ′ v ] + [E p a ′ , E p v ′ ] ) Finally, we minimize the mean square error (MSE) between â, v and normalized a, v. Compared with single-modal MAEs, the AV-MAE features a cross-modal masked data modeling objective that allows the model to reconstruct one modality based on the information of another modality, which may help the model learn audio-visual correlation. However, without an explicit objective of encouraging paired audio-visual correspondence, vanilla AV-MAE actually does not effectively leverage the audio-visual pairing information (discussed in Appendix J). Also, using a joint encoder for two modalities allows cross-modal attention, but it also means the two very different modalities are processed with the same weights, which could lead to a sub-optimal solution.

2.3. CONSTRASTIVE AUDIO-VISUAL MASKED AUTOENCODER (CAV-MAE)

As discussed in Section 2.1.3 and 2.2, contrastive audio-visual learning and AV-MAE each has its advantages and disadvantages. Can we integrate the complementary advantages of CAV and AV-MAE? With this goal, we design the Contrastive Audio-Visual Masked Autoencoder (CAV-MAE) (shown in Figure 1 .C). For a mini-batch of N audio-visual pair samples, we first pre-process and tokenize the audios and images and get a sequence of audio and visual tokens {a i , v i } for each sample i and project them to R 768 with two modal-specific linear projection layer. We also add a modality type embedding E a and E v and modality-specific 2-D sinusoidal positional embedding E p a and E p v . After that, we uniformly mask 75% of tokens of each modality, i.e., a unmask i = Mask 0.75 (Proj a (a i ) + E a + E p a ) v unmask i = Mask 0.75 (Proj v (v i ) + E v + E p v ) We then input a unmask i and v unmask i to independent audio and visual Transformer encoders E a (•) and E v (•) and get a ′ i and v ′ i , respectively. After that, we apply multi-stream forward passes to input a ′ i , v ′ i to a joint audio-visual encoder E j (•; MSA, LN1, LN2, MLP). Specifically, we input audio tokens a ′ i , video tokens v ′ i , and concatenated audio-visual tokens [a ′ i , v ′ i ] in three independent forward passes to E j . For each stream, we use different layer normalization layers LN1 {a,v,av} and LN2 {a,v,av} , all other weights (i.e., weights of the MSA and MLP) of E j are shared for all three streams. Formally, c a i = MeanPool(E j (E a (a unmask i )); LN1 a , LN2 a )) c v i = MeanPool(E j (E v (v unmask i )); LN1 v , LN2 v )) x i = E j ([E a (a unmask i ), E v (v unmask i )]; LN1 av , LN2 av ) We use the output of the audio and visual single modality stream c a i and c v i for contrastive learning and the output of the audio-visual multi-modal stream x i for the reconstruction task. For contrastive audio-visual learning, we use the contrastive loss L c : L c = - 1 N N i=1 log exp(s i,i /τ ) k̸ =i exp(s i,k /τ ) + exp(s i,i /τ ) where s i,j = ∥c v i ∥ T ∥c a j ∥ and τ is the temperature. For the reconstruction task, we pad x i with trainable masked tokens at their original position as x ′ i . We also add modality type embedding E ′ a and E ′ v and modality-specific 2-D sinusoidal positional embedding E p a ′ and E p v ′ before feeding x ′ i to a joint audio-visual decoder D j (•) to reconstruct the input audio and image. D j (•) processes audio and visual tokens with a same set of weights except the last modal-specific projection layer, it outputs âi and vi . We then apply a mean square error reconstruction loss L r : âi , vi = D j (x ′ + [E ′ a , E ′ v ] + [E p a ′ , E p v ′ ]) (8) L r = 1 N N i=1 (â mask i -norm(a mask i )) 2 |a mask i | + (v mask i -norm(v mask i )) 2 |v mask i | (9) where N is the mini-batch size; a mask , v mask , âmask , vmask denote the original and predicted masked patches (we only calculate the loss based on the masked portion of the input); |a mask | and |v mask | denote the number of masked audio and visual patches, respectively. Finally, we sum up the contrastive loss L c (multiplied by a weight λ c ) and the reconstruction loss L r as the loss for CAV-MAE, i.e., L CAV-MAE = L r + λ c • L c . After pretraining, we abandon the decoder and only keep the encoders of the model for downstream tasks. We can use the sum of the single-modality stream output and the multi-modal modality stream output, or just the multi-modal stream output for finetuning. They perform similarly in our experiments. Discussion: we next discuss the motivation of some key designs of CAV-MAE: 1. Multi-stream forward passes of the joint encoder. We find it important to restrict the representations used for contrastive audio-visual learning, so that c a only comes from the audio input and c v only comes from the visual input, otherwise the contrastive objective will collapse. In the meantime, we hope the encoder fuses the audio and visual information for the reconstruction task and downstream tasks. Therefore, we design the multi-stream forward pass strategy for CAV-MAE. 2. Modality-specific encoders and LN layers. While there are a few recent attempts (Akbari et al., 2021; Dai et al., 2022) to process audio and visual modalities with a unified network, due to the very different nature of audio and visual modalities, the general conclusion is that modality-specific networks are still optimal in terms of performance. Therefore, we choose to encode audio and visual inputs with modality-specific encoders before the joint encoder. For the same reason, we also use different normalization statistics for each stream of the joint encoder. Efficiency-wise, having two modality-specific encoders increases the model size, but lowers the computation as the Transformer has a quadratic complexity w.r.t. the input sequence length. 3. Masked contrastive audio-visual learning. Unlike single-modality contrastive learning, conventional contrastive audio-visual learning does not typically apply augmentation or masking. In this work, we propose to use masked contrastive audio-visual learning, i.e., we randomly mask a portion of the input before conducting contrastive learning. This design not only allows us to combine CAV with AV-MAE, but also helps to avoid overfitting. In practice, when the masking ratio is 75% and the effective contrastive batch size is 27 (108 on 4 GPUs), the audio-visual matching accuracy during pretraining on the evaluation set is about 72%, which shows the task is neither trivial nor impossible. We discuss the impact of masking on contrastive learning in detail in Appendix F.

2.3.1. IMPLEMENTATION DETAILS

By default, all encoder Transformer layers are 768-dimensional and have 12 attention heads. The joint encoder of the Vanilla AV-MAE is a 12-layer Transformer; The audio and visual encoders of CAV-MAE are 11-layer Transformers (each is 768-dimensional) and the joint encoder is a singlelayer Transformer. I.e., we control the total number of encoder layers of all models as 12, but CAV and CAV-MAE are larger models due to the modality-specific encoders. The decoder of AV-MAE and CAV-MAE are 8-layer Transformers with an embedding dimension of 512 and 16 attention heads. These settings are identical to the original vision MAE He et al. (2022) . We fix the contrastive loss temperature τ = 0.05. For CAV-MAE, we use λ c = 0.01. Note the relatively small λ c is due to the scale of the gradient of L c being larger than L r , it does not mean the contrastive objective is unimportant. The encoder and decoder of the default CAV-MAE model have about 164M and 27M parameters, respectively. Following the common practice of audio-visual learning, we initialize the weights of all models with ImageNet pretrained weights. Specifically, we use the weights of the original vision MAE He et al. (2022) . Nevertheless, unlike previous work that uses supervised pretrained weights (e.g., Fayek & Kumar (2021) and Nagrani et al. ( 2021)), we only use the self-supervised pretrained weights (i.e., without finetuning), which does not lead to the best performance but makes our whole training pipeline self-supervised. The impact of initialization strategy is discussed in detail in Appendix E.

3. SELF-SUPERVISED MODEL PRETRAINING

We pretrain and compare the performance of the following models: 1. Audio-MAE/Visual-MAE: Single-modal masked auto-encoder models. The model architecture is the same with Vanilla AV-MAE but they are only pretrained with data of a single modality.

2.. CAV:

The contrastive audio-visual learning model that has no reconstruction objective. For a fair comparison, we implement CAV using the same encoder architecture (modal-specific encoders + joint encoder) with CAV-MAE but remove the reconstruction objective L r .

3.. Vanilla AV-MAE:

The vanilla audio-visual masked auto-encoder with a joint encoder and no contrastive objective as described in Section 2.2.

4.. AV-MAE:

The audio-visual masked auto-encoder with two modal-specific encoders and a joint encoder. It has the same architecture with CAV-MAE, but λ c is set to 0 (no contrastive loss). We use this model to disentangle the impact of modal-specific encoders (when compared with Vanilla AV-MAE) and contrastive objective (when compared with CAV-MAE).

5.. CAV-MAE:

Our proposed contrastive masked auto-encoder as described in Section 2.3.

6.. CAV-MAE scale+ :

The same model with CAV-MAE, but trained with a larger batch size=108 (effective contrastive batch size=27) and more epochs=25. We train this model on our best GPUs. For a fair comparison, all models (except CAV-MAE scale+ ) are pretrained with the same pipeline with a batch size of 48 for 12 epochs on the full AudioSet-2M. During pretraining, we intentionally do not use class balanced sampling as that implicitly leverages the label information. Our pretraining process (including the ImageNet pretrained weight initialization) is fully self-supervised. Please refer to Appendix B for all pretraining details.

4. AUDIO-VISUAL EVENT CLASSIFICATION

We evaluate the representation quality on the audio-visual event classification task, a major audiovisual learning benchmark. Specifically, we fine-tune the pretrained models on three datasets: 1) AudioSet-20K (20K samples, same domain as the pretraining data); 2) AudioSet-2M (2 million samples, same with pretraining data); and 3) VGGSound (200K samples, different domain than the pretraining data), covering various downstream data volume and domain situations. In the fine-tuning stage, we only keep the encoder of the pretrained models and connect it to a randomly initialized linear classification head. To avoid overriding too much of the knowledge learned in pretraining, we use a smaller learning rate for the pretrained weights and a 10×-100× larger learning rate for the new classification head. We use the standard training pipeline used in prior audio-based and audio-visual event classification work Gong et al. (2021a; b) ; Nagrani et al. (2021) with mixup Zhang et al. (2018) , balanced sampling, label smoothing, label enhancement (only for AudioSet-20K) and random time shifts. We fine-tune the model using audio-only data (A), video-only data (V), and audio-visual data (AV) to evaluate the single-modal and multi-modal representation quality. We show the results in Table 1 . Key findings are as follows: 1. Contrastive learning and masked data modeling are complementary. While both AV-MAE (only with masked data modeling objective) and CAV (only with contrastive objective) perform better than ensembling two single-modal MAEs, the proposed CAV-MAE that combines the two objectives significantly boosts the performance (e.g., 2.0 and 3.1 mAP boost from CAV and AV-MAE on AudioSet-20K, respectively). Note CAV-MAE, AV-MAE, and CAV have the same architecture during fine-tuning, the only difference is the objective in the pretraining stage. This demonstrates that the two major self-supervised learning frameworks are complementary in the context of audiovisual learning and CAV-MAE is an effective way to combine their advantages. 2. CAV-MAE multi-modal pretraining improves single-modal performance. We find the CAV-MAE model pretrained with paired audio-visual data, when fine-tuned with only a single modality, performs noticeably better than Audio-MAE and Visual-MAE on single-modal classification tasks (e.g., 34.2→37.7 mAP for audio, 15.7→19.8 mAP for visual on AudioSet-20K). Note for single-modal fine-tuning, CAV-MAE only keeps one branch and has the same architecture with Audio-MAE and Visual-MAE, so the performance improvement can only come from the use of multi-modal data during pretraining. We hypothesize this is due to the two modalities serving as soft labels for each other, providing richer information than the binary human-annotated labels. As a result, CAV-MAE achieves a new SOTA performance on audio-based event classification on AudioSet-20K (37.7 mAP) and VGGSound (59.5% accuracy), without supervised pretraining and industry-level computational resources. A V A-V A V A-V A V A-V Existing Audio- 3. Fully SSL pretrained CAV-MAE matches or outperforms SOTA models with significantly fewer computational resources. There are two major setting differences between this work and previous SOTA works. First, our pretraining is completely self-supervised so that our model can leverage web-scale unlabeled videos, while supervised ImageNet pretraining is commonly used in previous audio-visual works, e.g., in MBT (Nagrani et al., 2021) . ImageNet labels are strong supervision signals that can directly impact the visual branch performance (see Table 11 ). As a result, our visual branch is worse than the SOTA models. Second, we pretrain and fine-tune the model with 4 GPUs (which also makes our work easy to reproduce), while most SOTA models are trained with industry-level resources (e.g., 32 TPUs for Perceiver (Jaegle et al., 2021) , 64 GPUs for Audio-MAE (Huang et al., 2022a) and MBT), which brings many benefits such as large batch size (particularly useful for contrastive learning), multiple frames input (MBT uses 8 frames as input), and more training epochs (Audio-MAE pretrains the model for 32 epochs). Even with such setting differences, on the audio-visual event classification task, our CAV-MAE performs better than the best existing audio-visual model MBT on VGGSound (even when CAV-MAE is only pretrained on VGGSound, see Ablation Studies: We conduct a series of ablation studies to show the impact of each design factor. For each study, we use CAV-MAE scale+ or CAV-MAE as the base model, change one factor at a time, and report the downstream classification performance of the model on AudioSet-20K or VG-GSound. Our findings are as follows: the weight of the contrastive loss λ c has a large impact on the performance, too large or too small λ c leads to a noticeable performance drop (Table 2a ); Scaling up the pretraining epochs and batch size consistently leads to a performance improvement (Table 2b and 2c ); Normalizing the prediction target only leads to marginal performance improvement (Table 2d ); When finetuning on VGGSound, pretraining with the larger out-of-domain AudioSet-2M is better than pretraining with the smaller in-domain VGGSound itself, but pretraining first on AudioSet-2M and then on VGGSound leads to the best result (Table 2e ); During fine-tuning, using the output of the multi-modal stream of the encoder leads to better performance than using the concatenated single-modal stream outputs, and summing the output of two streams generally lead to similar result (Table 2f ); When only one modality is of interest, it is better to fine-tune the model with single-modal data than fine-tune the model with audio-visual data and do single modality inference. However, the performance gap is small for audio (Table 2g ); The frame aggregation strategy boosts the performance without the need to input multiple frames simultaneously to the model (Table 2h ); In the linear probe setting, CAV-MAE also noticeably outperform the baselines (Table 2i ). We also study the impact of model initialization, masking strategy, and frame rate in Appendix E,F,G, respectively. Visual → Audio AudioSet Eval Subset VGGSound Eval Subset R@1 R@5 R@10 R@1 R@5 R@10 In the previous section, we show that CAV-MAE learns a good audio-visual joint representation that effectively fuses the unimodal signals for the audio-visual event classification task. Next, we study if CAV-MAE also learns a good coordinated representation that captures audio-visual correspondences for audio-visual retrieval. Specifically, we uniformly sample a subset of 1,725 and 1,545 audio-visual samples from the AudioSet and VGGSound evaluation set, respectively (about 10%) to make the similarity matrix of a reasonable size. We input audio and image to each model in two independent forward passes and take the mean-pooled encoder outputs as the audio and visual representation, respectively. We then calculate the retrieval recall at rank 1, 5, and 10 (R@1, R@5, R@10) based on the cosine similarity of the audio and visual representation. All models are self-supervised pretrained but not fine-tuned. We show the quantitative results and samples of visual→audio retrieval in Table 3 and Figure 2 , respectively. The results of audio→ visual retrieval, more samples, and additional retrieval experiments on MSR-VTT (Xu et al., 2016) can be found in Appendix D.

5. AUDIO-VISUAL RETRIEVAL

We find a contrastive objective is necessary for the audio-visual retrieval task as the performance of both Vanilla-MAE and AV-MAE are close to random guesses. Nevertheless, the cross-modal masked data modeling objective does not hurt, and in many cases, improves the retrieval performance, e.g., when λ c = 0.1, CAV-MAE generally performs better than CAV. Scaling up the batch size and training epoch also leads to a better retrieval performance. When tested on a dataset different from the pretraining dataset (VGGSound), the retrieval performance is still competitive, indicating the audio-visual correspondence transfers well in addition to the audio and visual representations. These results demonstrate that the contrastive and mask data modeling objectives do not conflict, a single pretrained CAV-MAE can be applied to both audio-visual fusion and correspondence tasks.

6. RELATED WORK

Contrastive Audio-Visual Learning The natural pairing of audio and visual information in videos has been a useful signal for learning audio-visual representations through self-supervision. Existing methods include knowledge distillation (Aytar et al., 2016; Owens et al., 2016) , paired sample discrimination (Arandjelovic & Zisserman, 2017; Korbar et al., 2018; Owens & Efros, 2018) , and contrastive learning (Morgado et al., 2021b) . To improve contrastive learning, some recent methods sought to mine better negative samples (Ma et al., 2020; Morgado et al., 2021a) , while others proposed additional data augmentation (Patrick et al., 2021; Wang et al., 2021) or using global and local video views (Zeng et al., 2021; Recasens et al., 2021) . Our approach instead combines the contrastive loss with masked data modeling, which not only leads to an improvement in classification performance but also maintains the compelling ability of audio-visual retrieval (Arandjelovic & Zisserman, 2018; Rouditchenko et al., 2021) . Masked Auto-Encoder. Masking data modeling has a long history (Vincent et al., 2008) and has been applied on visual and audio domains (Baevski et al., 2020; Hsu et al., 2021; Srivastava et al., 2022) Given the success of MAE in the vision domain (He et al., 2022; Bachmann et al., 2022; Girdhar et al., 2022; Tong et al., 2022; Feichtenhofer et al., 2022) , several efforts adapt MAE for audio with relatively minor changes to the overall pipeline (Baade et al., 2022; Niizumi et al., 2022; Chong et al., 2022; Huang et al., 2022a) . There are a few recent works investigating multi-modal MAE for the vision & language multi-modal scenarios (Geng et al., 2022; Kwon et al., 2022) , which inspired us to design an audio-visual MAE. To the best of our knowledge, our AV-MAE and CAV-MAE are the first audio-visual masked autoencoders. One closely related concurrent work is CMAE (Huang et al., 2022b) , which also combines MAE and contrastive loss, but only for single-modal images. Our motivation and implementation are very different from CMAE as we aim to leverage the unique audio-visual pair information and CAV-MAE features a multi-stream joint encoder design. Finally, while we take a modern approach with Transformers, multi-modal autoencoders have been studied more than a decade ago with much simpler models and datasets (Ngiam et al., 2011) .

7. CONCLUSION

In this paper, we introduce CAV-MAE, a novel audio-visual learning model. The main idea of this paper is simple: masked data modeling and contrastive learning are a pair of complementary frameworks that should be used together for audio-visual self-supervised learning. Effectively combining the two frameworks and avoiding representation collapse requires some careful design such as the multi-stream forward pass strategy, joint-specific encoder architecture, and masked contrastive learning. From the perspective of representation learning, CAV-MAE learns a joint and coordinated representation and can be used for both audio-visual joint event classification task as well as the audio-visual retrieval task. As a result, on the audio-visual event classification task, CAV-MAE matches or outperforms SOTA models with fully self-supervised pretraining and noticeably fewer computational resources; on the retrieval task, CAV-MAE is comparable to models trained with only the contrastive objective. Finally, CAV-MAE multi-modal pretraining also learns strong singlemodal representations, which leads to a new SOTA performance on audio-based event classification. Acknowledgments: This research is supported by the MIT-IBM Watson AI Lab. A DATASET DETAILS 

B TRAINING DETAILS

Our training hyper-parameters are listed in Table 4 . Most of our experiments are run on 4×NVIDIA GTX Titan X Pascal GPUs with 12GB memory, only the scaled-up CAV-MAE Scale+ is pretrained on 4×NVIDIA RTX A5000 GPUs with 24GB memory, making our result easier to reproduce with reasonable resources. Pretraining CAV-MAE takes about one week with 4 GPUs. Our model has a similar size with "base" MAE models, i.e., the full encoder and decoder model has ∼190M parameters (due to two modal-specific branches); the encoder used for audio-visual downstream task is ∼160M parameters; the encoder used for single-modal downstream task is ∼85M parameters. 

C AUDIO-VISUAL ACTION RECOGNITION EXPERIMENTS

In addition to the audio-visual event classification task on AudioSet and VGGSound, we also test our models on the audio-visual action recognition tasks. One problem with existing audio-visual action recognition datasets is they are usually visual-heavy and dominated by the performance of the visual branch. Therefore, to test our audio-visual model, we choose to conduct experiments on Kinetics-Sounds (Arandjelovic & Zisserman, 2017) , a subset of Kinetics-400 dataset (Kay et al., 2017) with 32foot_2 human action classes that have been chosen to be potentially manifested visually and aurally. We conduct two experiments on Kinetics-Sounds: First, we pretrain and fine-tune CAV, AV-MAE, and CAV-MAE using the Kinetics-Sounds training set and report the Top-1 accuracy on the Kinetics-Sounds validation set (i.e., no AudioSet pretraining). This is to check if CAV-MAE still outperforms its counterparts on the audio-visual action recognition task. As shown in Table 5 , the conclusion on Kinetics-Sounds is consistent with that on AudioSet and VGGSound, i.e., CAV-MAE performs better than both CAV and AV-MAE. Second, we compare CAV-MAE models with SOTA MBT model (Nagrani et al., 2021) following the protocol of MBT. Specifically, we train the model on Kinetics-400 (K400) dataset and report the top-1 accuracy on Kinetics-Sounds. We find the label set used impacts the accuracy and this setting is not clear in the MBT paper. Therefore, we report the results on both the Kinetics-400 label set (i.e., not restrict predictions in 32 Kinetics-Sounds classes) and the Kinetics-Sounds label set (i.e., restrict predictions in 32 Kinetics-Sounds classes). As shown in We show audio to visual retrieval results on AudioSet and VGGSound (zero-shot) in Table 7 . Table 7 : Audio to visual retrieval results on AudioSet and VGGSound. Audio→Visual Retrieval AudioSet Eval Subset VGGSound Eval Subset R@1 R@5 R@10 R@1 R@5 R@10 We show bi-directional zero-shot VGGSound retrieval samples in Figure 7 and Figure 8 . Table 8 : Audio-visual bi-directional retrieval results on MSR-VTT dataset. All models, including the baseline models, are initialized with ImageNet weights and trained with only MSR-VTT data. Our CAV and CAV-MAE models outperform existing methods in both directions. In addition, comparing CAV and CAV-MAE, we again find the MAE training objective does not hurt, or even improve the retrieval performance. Audio→Visual Visual→Audio R@1 R@5 R@10 R@1 R@5 R@10 Random 0.1 0.5 1 0.1 0.5 1 

Pretrain Dataset

Audio→Visual Visual→Audio R@1 R@5 R@10 R@1 R@5 R@10 We also conduct audio-visual retrieval experiments on MSR-VTT (Xu et al., 2016) and compare our models with existing works. Specifically, we conduct two sets of experiments. First, we train CAV and CAV-MAE models on the MSR-VTT training set and evaluate them on the MSR-VTT test set. Note the models are not pretrained on AudioSet. We then compare the retrieval performance with existing works in the same training setting. As shown in Table 8 , our CAV and CAV-MAE models outperform existing methods in both directions. In addition, comparing CAV and CAV-MAE, we again find the MAE training objective does not hurt, or even improve the retrieval performance. Second, we conduct a zero-shot retrieval experiment on MSR-VTT. Specifically, we take the Au-dioSet pretrained models and directly evaluate them on the MSR-VTT test set. The MSR-VTT training set is not used. We then compare our models with existing models. As shown in Table 9 , our CAV-MAE model achieves similar results for visual-audio retrieval performance with existing methods but worse for the audio-visual direction. However, existing methods are trained with the 100M HowTo100M dataset, while our models are only trained with the 2M AudioSet dataset. With less than 2% of training data, our CAV-MAE model achieves similar results for visual-audio retrieval performance with existing methods. Again, CAV-MAE models still have similar or better results compared with CAV models when λ c is the same, demonstrating the MAE and contrastive objective do not conflict.

E IMPACT OF MODEL INITIALIZATION

Existing audio-visual models typically use (supervised) ImageNet pretrained weights to initialize the model. Throughout the paper, we always initialize our models (including CAV, AV-MAE, and CAV-MAE) with self-supervised ImageNet pretrained weights. Specifically, we use the weight from the original vision MAE model (He et al., 2022) (Weights from https://github.com/ facebookresearch/mae) with only self-supervised learning (SSL) pretraining for all audio, visual, and joint encoder and the decoder. This is implemented by duplicating the weights of MAE encoder layer 1-11 for the audio and visual encoder, respectively, and the weights of MAE encoder layer 12 for the joint encoder. How important is this initialization? We conduct experiments with various model initialization and pretraining settings. As shown in Table 10 , we find that ImageNet initialization always leads to a performance improvement, no matter in fine-tuning or linear probing test, and such improvement decreases with a larger in-domain pretraining dataset, e.g., without ImageNet initialization, CAV-MAE performs just 1.0% mAP lower on AudioSet-2M. Therefore, ImageNet initialization is not an indispensable component of the proposed CAV-MAE pretraining framework. Finally, we quantify the difference between initialing the model with ImageNet SSL pretrained weights and ImageNet SL pretrained weights on the downstream task. As shown in Table 11 , on AudioSet-20K, using SL weights leads to a 3.7% improvement over using SSL weights in the finetuning setting (but interestingly, in the linear probing setting, SL weights lead to worse results). Therefore, directly comparing our fully self-supervised model with existing models with a supervised pretraining component is not exactly fair. 2021)) while we intend to build a fully self-supervised model to avoid using any labels. Compare the AudioSet-20K performance of models initialized with ImageNet supervised pretrained (SL) weights and ImageNet self-supervised pretrained (SSL) weights. The SL weights and SSL weights are from the original MAE models with and without supervised ImageNet finetuning (He et al., 2022) , respectively. Since the SL weights only contain weights of the MAE encoder part and cannot be used for further SSL pretraining. We directly fine-tune/linear probe the two models on AudioSet-20K (i.e., no in-domain pretraining) and report the results to make a fair comparison. We observe that initialing the model SL weights leads to a noticeable advantage for fine-tuning, showing the ImageNet labels are still very valuable supervision signals. This also indicates that directly comparing our fully self-supervised model with existing models with a supervised pretraining component is not exactly fair. Niizumi et al. (2022) . However, it is unclear if such a high masking ratio is also appropriate for the contrastive objective. In particular, aggressive augmentation is not commonly used in audio-visual contrastive learning. Therefore, we conduct experiments to check the impact of the training masking ratio on the audiovisual joint event classification task and the audio-visual retrieval task. For the audio-visual joint event classification task, as shown in Table 12 , we find the CAV model does perform slightly better with a smaller masking ratio (50%), but the difference is minor. When the masking ratio is 75%, CAV still performs well. This shows the audio-visual joint classification task is not sensitive to the masking ratio. For the audio-visual retrieval task, as shown in Table 13 , we find that the audio-visual retrieval performance decreases with a higher masking ratio, particularly when the masking ratio is very high. If audio-visual retrieval is the main task of interest, a lower masking ratio should be used in training, which does not hurt the audio-visual joint event classification task, but requires more computation. In Section 5 and Appendix D, we show CAV-MAE is already a strong audio-visual retrieval model when the masking ratio is 75%, the performance can be further improved by lowering the masking ratio. Note this result does not conflict with the fact that the reconstruction objective does not hurt, and in many cases, improves the retrieval performance. Table 12 : Audio-visual joint event classification performance of CAV, AV-MAE, and CAV-MAE as a function of masking ratio on AudioSet-20K and VGGSound. All models are pretrained with uniform unstructured masking. We find the contrastive learning model CAV performs slightly better with a lower masking ratio while the AV-MAE model performs best with ∼75% masking ratio. These results show that a 65%∼75% masking ratio works well for both contrastive learning and masked data modeling frameworks for the downstream audio-visual joint event classification task. Table 13 : Zero-shot audio-visual retrieval performance of CAV-MAE (λ c = 0.01) as a function of masking ratio on VGGSound evaluation subset. All models are pretrained with uniform unstructured masking. The audio-visual retrieval performance decreases with a higher masking ratio. Masking Ratio Audio→Visual Visual→Audio R@1 R@5 R@10 R@1 R@5 R@10 0.50 Another key design of masking is the masking strategy. Throughout the paper, we use a uniform, unstructured masking strategy for both audio and visual input. However, unlike visual modalities, the two dimensions of audio spectrograms are heterogeneous. In this section, we explore the impact of masking strategies for audio input. Specifically, we apply time, frequency, and time-frequency masking strategies (depicted in Figure 3 ) and compare them with the uniform unstructured masking strategy (i.e., uniform masking). For the audio-visual joint event classification task, as shown in Table 14 , we find that all four training masking strategies lead to similar performance when the training masking ratio is 75%. However, as we show in Figure 5 , structured masking strategies make reconstruction more challenging. Therefore, we also pretrain a CAV-MAE model trained with time-frequency masking at a lower masking ratio of 50%, which shows slightly better performance on both AudioSet-20K and VGGSound. In general, the audio-visual joint classification task is not sensitive to the masking strategy. For the audio-visual retrieval task, as shown in Table 15 , with the same 75% masking ratio, different masking strategies lead to noticeably different retrieval performance. Frequency and time-frequency masking leads to the best retrieval performance while unstructured uniform masking actually leads to the worst retrieval performance. In Section 5 and Appendix D, we show CAV-MAE is already a strong audio-visual retrieval model when uniform masking is used, the performance can be further improved by using a structured masking strategy, which also does not hurt the audio-visual joint event classification. To summarize, we find both the masking ratio and masking strategy have a minor impact on the downstream audio-visual joint event classification task, but have a noticeable impact on the audiovisual retrieval task. Specifically, there exist masking strategies that lead to better retrieval performance than the default 75% uniform masking strategy. Finally, we also notice the training masking strategy impacts the model reconstruction ability, which is discussed in Section H.2. Table 14 : Audio-visual joint event classification performance of CAV-MAE as a function of training masking strategy and ratio on AudioSet-20K and VGGSound. We find that all four training masking strategies lead to similar performance when the training masking ratio is 75%. However, as we show in Figure 5 , structured masking strategies make reconstruction more challenging. Therefore, we also pretrain a CAV-MAE model trained with time-frequency masking at a lower masking ratio of 50%, which shows slightly better performance on both AudioSet-20K and VGGSound. Masking Ratio Audio→Visual Visual→Audio R@1 R@5 R@10 R@1 R@5 R@10 

G IMPACT OF THE NUMBER OF FRAMES USED

In the paper, we sample 10 frames for each 10-second video clip (1 FPS). How does the frame rate impact the performance? As shown in Figure 4 , on all Kinetics-Sounds, AudioSet-20K, and VGGSound, higher FPS consistently improves the downstream classification performance, however, the improvement saturates with the increasing of frames. 

H.1 AUDIO-VISUAL RECONSTRUCTION SAMPLES

We show the CAV-MAE reconstruction samples in Figure 9 , 10, and 11. All samples are from VGGSound, a different dataset from the pretraining set. The CAV-MAE model is trained with a 75% masking ratio without target normalization. As shown in Table 2d ., it has a similar performance to the default model with target normalization. CAV-MAE has strong reconstruction ability even if the masking ratio goes to 90%, which makes it potentially can be used for in-painting and enhancement tasks. All inference masks are sampled uniformly (i.e., unstructured masking).

H.2 AUDIO SPECTROGRAM RECONSTRUCTION UNDER VARIOUS INFERENCE MASKING SETTINGS

Besides uniform masking samples shown in the previous section, we also show the audio spectrogram reconstruction samples under various structured inference masking settings in Figure 12 (75% masking ratio) and Figure 13 (90% masking ratio). We find structured masking is more challenging (red) . Both models are trained with a 75% masking ratio. Key findings are as follows: 1) Even for the same masking ratio, the reconstruction hardness is different for each masking strategy. On average, time masking is the most difficult, followed by frequency masking, time-frequency masking, and uniform unstructured masking. This indicates that CAV-MAE models require local information for the reconstruction task. However, for each specific spectrogram, the order of difficulty varies (see Figure 12 and 13). Second, the CAV-MAE model trained with time-frequency masking generally performs better than its counterpart trained with uniform masking in audio spectrogram reconstruction, particularly for the time masking and frequency masking settings, showing it is stronger in leveraging global information. This indicates different training masking strategies do impact the properties of the model. for reconstruction as the mean squared errors are generally higher. On average, time masking is the most difficult, followed by frequency masking, time-frequency masking, and uniform unstructured masking. This also indicates that the model leverages local neighboring unmasked part information to infer the masked part. When an entire time or frequency span is masked, the model is harder to reconstruct (this is quantified in Figure 5 ). Finally, in Figure 12 and Figure 13 , we also compare the reconstruction ability of a CAV-MAE model trained with uniform, unstructured masking strategy and a CAV-MAE model trained with time-frequency masking strategy (both with 75% masking ratio). We quantify the difference in 

I CAV-MAE VISUAL SOUND SOURCE LOCALIZATION RESULTS

We evaluate the capability of CAV-MAE (uniform masking, masking ratio = 75%, λ c =0.01) on the visual sound source localization task with a basic similarity-based method. Specifically, for each audio-image pair, we mean pool the representations of all audio tokens as the clip-level audio representation, and then calculate the cosine similarity between the clip-level audio representation with all patch-level image representations as the visual sound source localization heat map. In general, we find the CAV-MAE model is not a strong visual sound source localization model though its audio-visual retrieval performance is good. In Figure 6 , we show a successful sample (left) and a failed sample (right). In some cases, CAV-MAE localizes the sound to the background instead of the main sound source object. We hypothesize that it is due to the masked contrastive learning objective. During the training process, the model needs to match positive audio-visual pairs even when both modalities are heavily masked, in some situations, the main sound source could be completely masked, the model thus learns to leverage the context information for the matching, which may hurt its performance on the visual sound source localization task. 



Multi-modal representations can be divided into two categories: joint representations that combine the unimodal signals into the same representation space, and coordinated representations that process unimodal signals separately, but enforce certain similarity constraints on them.(Baltrušaitis et al., 2018) The original Kinetics-Sounds dataset consists of classes with an early version of Kinetics-400 label set. We contact the authors and use the 32-class label set defined in(Xiao et al., 2020) for our experiments.



Figure 2: Sample retrieval results.

Figure 3: Illustration of various masking strategies. We use uniform unstructured masking throughout the paper except in Section F.

. Interestingly, we find the CAV-MAE model trained with time-frequency masking generally performs better than its counterpart trained with uniform masking in audio spectrogram reconstruction, particularly for the time masking and frequency masking settings, showing it is stronger in leveraging global information. This indicates different training masking strategies do impact the properties of the model. While the training masking strategy only minorly impacts the downstream classification task, it has a relatively large impact on reconstruction.

Figure 6: A successful sample (left) and a failed sample (right) of CAV-MAE on the visual sound source localization task. In some cases, CAV-MAE localizes the sound to the background instead of the main sound source object.

Figure 7: Zero-shot audio to image retrieval results on VGGSound. Since the spectrograms are hard to read, we show their paired images in the dashed boxes for visualization purposes, only audios are used as queries.

Figure 8: Zero-shot image to audio retrieval results on VGGSound. Since the spectrograms are hard to read, we show their paired images in the dashed boxes for visualization purposes, only audios are used as keys.

Figure 9: CAV-MAE reconstruction samples when 50% of the input is masked. Samples are from VGGSound, a different dataset from the pretraining dataset. The model is pretrained on AudioSet with a 75% masking ratio without target normalization.

Figure 10: CAV-MAE reconstruction samples when 75% of the input is masked. Samples are from VGGSound, a different dataset from the pretraining dataset. The model is pretrained on AudioSet with a 75% masking ratio without target normalization.

Figure 11: CAV-MAE reconstruction samples when 90% of the input is masked. Samples are from VGGSound, a different dataset from the pretraining dataset. The model is pretrained on AudioSet with a 75% masking ratio without target normalization.

., x n ], MAE masks a portion of the input x mask and only inputs the unmasked tokens x \ x mask to a Transformer based encoder-decoder model. The model is asked to reconstruct the masked tokens with the goal of minimizing the mean square error (MSE) loss. During this process, the model learns a meaningful representation of the input data. The advantages of MAE are multifold. First, MAE directly uses the original input as the prediction target, which greatly simplifies the training pipeline. Second, MAE only inputs unmaksed tokens to the encoder, and combined with a high masking ratio, MAE noticeably lowers the computational overhead. Third, MAE demonstrated strong performance in single-modal tasks for both audio and visual modalities. Due to the space limitation, please refer toHe et al. (2022);Huang et al. (2022a)  for single-modal MAEs.

Comparing audio-visual classification performance on AudioSet and VGGSound. IN SL=ImageNet supervised learning; SSL=self-supervised learning; † Industry-level computation. Nonstandard data split; ens Ensemble of single-modal models. We bold the best methods without supervised pretraining, and underline the overall best methods.



Ablation studies on audio-visual classification. MM=multi-modal, SM=single-modal.

Retrieval results on AudioSet and VGGSound.

Our pre-training and fine-tuning hyperparameters.

SL) model and self-supervised learning (SSL) model initialization, please see Table11.

Comparison of CAV, AV-MAE, and CAV-MAE models on Kinetics-Sounds. For each model, we pretrain and fine-tune it using the Kinetics-Sounds training set and report the Top-1 accuracy on the Kinetics-Sounds validation set. The conclusion is consistent with our AudioSet and VGGSound experiments that CAV-MAE outperforms both CAV and AV-MAE.

Comparison of CAV-MAE models with SOTA MBT model(Nagrani et al., 2021) on Kinetics-Sounds. Following the protocol of MBT, we train the model on Kinetics-400 (K400) dataset and report the top-1 accuracy on Kinetics-Sounds. We report the results on both the Kinetics-400 label set (i.e., not restrict predictions in 32 Kinetics-Sounds classes) and the Kinetics-Sounds label set (i.e., restrict predictions in 32 Kinetics-Sounds classes). Our CAV-MAE matches or outperforms MBT on Kinetics-Sounds with a fully self-supervised learning (SSL) setting.

Zero-shot audio-visual bi-directional retrieval results on MSR-VTT dataset. Existing methods are trained with the 100M HowTo100M dataset, while our models are only trained with the 2M AudioSet dataset. With less than 2% of pretraining data, our CAV-MAE model achieves similar results for visual-audio retrieval performance with existing methods. Again, CAV-MAE models have similar or better results compared with CAV models when λ c is the same.

CAV-MAE model performance with various model initialization and pretraining settings on AudioSet-20K, VGGSound, and AudioSet-2M. We report both end-to-end fine-tuning and linear probing results. Initializing CAV-MAE with ImageNet pretrained weights consistently improves the model performance, but is not an indispensable component. Without ImageNet initialization, CAV-MAE performs just 1.0% mAP lower on AudioSet-2M.

Most existing audio-visual models initialize their weights with ImageNet supervise pretrained weights (e.g.,Nagrani et al. (2021);Rouditchenko et al. (

Zero-shot audio-visual retrieval performance of CAV-MAE (λ c = 0.01) as a function of training masking strategy on VGGSound evaluation subset. All models are trained with a masking ratio of 75% on AudioSet. The masking strategy has a noticeable impact on retrieval performance.

Audio spectrogram reconstruction mean squared error (MSE) as a function of masking ratio under various inference masking settings (from left to right: time masking, frequency masking, time-frequency masking, and uniform unstructured masking). We compare a CAV-MAE model trained with uniform masking (blue) and a CAV-MAE model trained with time-frequency masking

Comparing the audio-visual joint event classification performance of models trained with AudioSet with original audio-visual pairs and AudioSet with randomly shuffled audio-visual pairs.

ETHICS STATEMENT

The data used in this paper are publicly available YouTube videos, we do not use videos that have been removed by the user. The proposed audio-visual model can be applied in a wide range of areas including security-related applications. However, it can also be used for malicious purposes such as surveillance. We are committed to distributing our code and model carefully.

REPRODUCIBILITY STATEMENT

We document all implementation details in Section 2.3.1 and Appendix B. Code and pretrained models are available at https://github.com/yuangongnd/cav-mae. 

