CONTRASTIVE AUDIO-VISUAL MASKED AUTOENCODER

Abstract

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments show that the contrastive audio-visual correspondence learning objective not only enables the model to perform audio-visual retrieval tasks, but also helps the model learn a better joint representation. As a result, our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet in the audio-visual event classification task. Code and pretrained models are at https://github.com/yuangongnd/cav-mae.

1. INTRODUCTION

Acoustic and visual modalities have different properties, yet humans are able to seamlessly connect and integrate them to perceive the world. Developing learning algorithms to replicate these abilities, especially for multi-modal audio-visual fusion and retrieval is of great interest. Since manually annotating audio and video is expensive and difficult to scale, how to utilize web-scale unlabeled video data in a self-supervised manner has become a core research question. One major line of audio-visual self-supervised learning research is leveraging the natural audiovisual correspondences found in videos. Among numerous ways to use such correspondences, Contrastive Audio-Visual Learning has shown to be a simple yet effective approach (Arandjelovic & Zisserman, 2018; Morgado et al., 2021b; Rouditchenko et al., 2021) . It learns coordinatedfoot_0 representations that are closer for paired audio and visual samples than for mismatched samples. Such coordinated representations are particularly useful for tasks such as cross-modal retrieval. Another vetted commonly used self-supervised learning framework is Masked Data Modeling (MDM), which learns a meaningful representation with the pretext task of recovering the original inputs or features from the corrupted ones (Devlin et al., 2019) . Particularly, based on the Audio Spectrogram Transformer (Gong et al., 2021a) and Vision Transformer (Dosovitskiy et al., 2020) backbones, the single-modal Masked Auto-Encoder (MAE) (He et al., 2022) achieved state-of-theart (SOTA) performance on images and audio tasks (Huang et al., 2022a) individually. Inspired by these advances, we propose to extend the single-modal MAE to Audio-Visual Masked Auto-Encoder (AV-MAE), aiming to learn a joint representation that fuses the unimodal signals. Although these two major self-supervised frameworks have been widely used individually, to the best of our knowledge, they have never been combined in audio-visual learning. In fact, we find they are complementary: Contrastive audio-visual learning explicitly leverages the very useful audiovisual pair information, but it could discard modality-unique information that is useful in downstream tasks; The reconstruction task of AV-MAE forces its representation to encode the majority of the input information in the fusion, but it lacks an explicit audio-visual correspondence objective. This motivates us to design the Contrastive Audio-Visual Masked Autoencoder (CAV-MAE) that integrates contrastive learning and masked data modeling which learns a joint and coordinated audio-visual representation with a single model. Our experiments support our design: on audiovisual event classification, CAV-MAE significantly outperforms baseline models trained with only contrastive or masked data modeling objectives, demonstrating that the two objectives are complementary in learning a strong joint audio-visual representation. As a result, CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet. Moreover, when it comes to audio-visual retrieval, CAV-MAE also performs equally well or even better than models trained with only the contrastive objective, which demonstrates that CAV-MAE can learn both a joint and coordinated representation well. Finally, CAV-MAE multi-modal pretraining improves single-modal performance, consequently, CAV-MAE achieves a new SOTA for audio-based event classification on AudioSet-20K and VGGSound. In summary, our contributions are: (1) We extend the single-modal MAE to multi-modal AV-MAE, which fuses audio-visual inputs for self-supervised learning through cross-modal masked data modeling; (2) More importantly, we investigate how to best combine contrastive audio-visual learning with masked data modeling and propose CAV-MAE; (3) We demonstrate that contrastive and masked data modeling objectives are complementary. As a result, CAV-MAE matches or outperforms SOTA models on audio-visual classification.

2. CONSTRASTIVE AUDIO-VISUAL MASKED AUTOENCODER

2.1 PRELIMINARIES

2.1.1. AUDIO AND IMAGE PRE-PROCESSING AND TOKENIZATION

As depicted in Figure 1 (A), we follow pre-processing and tokenization in AST (Gong et al., 2021a) and ViT (Dosovitskiy et al., 2020) for audio and image inputs, respectively. Specifically, we use 10-second videos (with parallel audios) in AudioSet (Gemmeke et al., 2017) and VGGSound (Chen et al., 2020) to pretrain and fine-tune the model. For audio, each 10-second audio waveform is first converted to a sequence of 128-dimensional log Mel filterbank (fbank) features computed with a 25ms Hanning window every 10ms. This results in a 1024(time) × 128(frequency) spectrogram. We then split the spectrogram into 512 16 × 16 square patches a = [a 1 , ..., a 512 ] as the input of the model. Processing video with Transformer models is expensive and typically requires industriallevel computation resources. To lower the computational overhead and fit our resources, we use a frame aggregation strategy. Specifically, we uniformly sample 10 RGB frames from each 10-second video (i.e., 1 FPS). During training, we randomly select one RGB frame as the input; during inference, we average the model prediction of each RGB frame as the video prediction. Compare with concatenating multiple RGB frames as the input of the Transformer that has a quadratic complexity (e.g., in Nagrani et al. (2021) ), frame aggregation is much more efficient with a linear complexity in time at a cost of not considering inter-frame correlation. For each RGB frame, we resize and center crop it to 224 × 224, and then split it into 196 16 × 16 square patches v = [v 1 , ..., v 196 ].

2.1.2. THE TRANSFORMER ARCHITECTURE

Throughout this paper, we use the standard Transformer (Vaswani et al., 2017) as our main model component. Each Transformer layer consists of multi-headed self-attention (MSA), layer normalization (LN), and multilayer perceptron (MLP) blocks with residual connections. Specifically, we denote a Transformer layer y = Transformer(x; MSA, LN1, LN2, MLP) as: x ′ = MSA(LN 1 (x)) + x; y = MLP(LN 2 (x ′ )) + x ′ (1) where MSA computes dot-product attention of each element of x and thus has a quadratic complexity w.r.t. to the size of x. Please refer to Vaswani et al. (2017) for further details on Transformers.

2.1.3. CONTRASTIVE AUDIO-VISUAL LEARNING (CAV)

The natural pairing of audio and visual information in videos is a useful signal for learning audiovisual representations through self-supervision. A conventional CAV model is shown in Figure 1 .B (top), for a mini-batch of N audio-visual pair samples, we first pre-process and tokenize the audios



Multi-modal representations can be divided into two categories: joint representations that combine the unimodal signals into the same representation space, and coordinated representations that process unimodal signals separately, but enforce certain similarity constraints on them.(Baltrušaitis et al., 2018)

