CONTRASTIVE AUDIO-VISUAL MASKED AUTOENCODER

Abstract

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments show that the contrastive audio-visual correspondence learning objective not only enables the model to perform audio-visual retrieval tasks, but also helps the model learn a better joint representation. As a result, our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet in the audio-visual event classification task. Code and pretrained models are at https://github.com/yuangongnd/cav-mae.

1. INTRODUCTION

Acoustic and visual modalities have different properties, yet humans are able to seamlessly connect and integrate them to perceive the world. Developing learning algorithms to replicate these abilities, especially for multi-modal audio-visual fusion and retrieval is of great interest. Since manually annotating audio and video is expensive and difficult to scale, how to utilize web-scale unlabeled video data in a self-supervised manner has become a core research question. One major line of audio-visual self-supervised learning research is leveraging the natural audiovisual correspondences found in videos. Among numerous ways to use such correspondences, Contrastive Audio-Visual Learning has shown to be a simple yet effective approach (Arandjelovic & Zisserman, 2018; Morgado et al., 2021b; Rouditchenko et al., 2021) . It learns coordinatedfoot_0 representations that are closer for paired audio and visual samples than for mismatched samples. Such coordinated representations are particularly useful for tasks such as cross-modal retrieval. Although these two major self-supervised frameworks have been widely used individually, to the best of our knowledge, they have never been combined in audio-visual learning. In fact, we find they are complementary: Contrastive audio-visual learning explicitly leverages the very useful audiovisual pair information, but it could discard modality-unique information that is useful in downstream tasks; The reconstruction task of AV-MAE forces its representation to encode the majority of the input information in the fusion, but it lacks an explicit audio-visual correspondence objective.



Multi-modal representations can be divided into two categories: joint representations that combine the unimodal signals into the same representation space, and coordinated representations that process unimodal signals separately, but enforce certain similarity constraints on them.(Baltrušaitis et al., 2018)



Another vetted commonly used self-supervised learning framework is Masked Data Modeling (MDM), which learns a meaningful representation with the pretext task of recovering the original inputs or features from the corrupted ones (Devlin et al., 2019). Particularly, based on the Audio Spectrogram Transformer (Gong et al., 2021a) and Vision Transformer (Dosovitskiy et al., 2020) backbones, the single-modal Masked Auto-Encoder (MAE) (He et al., 2022) achieved state-of-theart (SOTA) performance on images and audio tasks (Huang et al., 2022a) individually. Inspired by these advances, we propose to extend the single-modal MAE to Audio-Visual Masked Auto-Encoder (AV-MAE), aiming to learn a joint representation that fuses the unimodal signals.

