CONTRASTIVE SELF-SUPERVISED LEARNING OF GLOBAL-LOCAL AUDIO-VISUAL REPRESENTATIONS

Abstract

Contrastive self-supervised learning has delivered impressive results in many audio-visual recognition tasks. However, existing approaches optimize for learning either global representations useful for high-level understanding tasks such as classification, or local representations useful for tasks such as audio-visual source localization and separation. While they produce satisfactory results in their intended downstream scenarios, they often fail to generalize to tasks that they were not originally designed for. In this work, we propose a versatile self-supervised approach to learn audio-visual representations that can generalize to both the tasks which require global semantic information (e.g., classification) and the tasks that require fine-grained spatio-temporal information (e.g. localization). We achieve this by optimizing two cross-modal contrastive objectives that together encourage our model to learn discriminative global-local visual information given audio signals. To show that our approach learns generalizable video representations, we evaluate it on various downstream scenarios including action/sound classification, lip reading, deepfake detection, and sound source localization.

1. INTRODUCTION

Self-supervised learning aims to learn representations of data that generalize to a large variety of downstream tasks. Recently, contrastive self-supervised learning (CSL) has achieved impressive results on several computer vision tasks (Oord et al., 2018; Hjelm et al., 2018; He et al., 2020; Chen et al., 2020) . In CSL, the choice of "views" determines the types of information that the representation captures (Bachman et al., 2019) , as the framework learns representations that focus on the shared information between views. It has been demonstrated that the optimal choice of views depends critically on the downstream task (Tian et al., 2020) . Therefore, existing works mainly focus on finding different views tailored for the intended downstream tasks. For example, when tailoring views for action classification, Hjelm & Bachman (2020) extends DIM (Hjelm et al., 2018) to the spatio-temporal setting by assuming that global and local information useful for action classification (i.e, global semantics) should be invariant across time and space within a given video. When dealing with multimodal data, several approaches utilize audio-visual correspondence from videos (Morgado et al., 2020) . Such a CSL approach is based on an assumption that information needed for audio/video classification should be shared between the two modalities. Although they achieve impressive results in their intended downstream tasks, existing approaches often fail to generalize to tasks that they were not originally designed for. For example, in lip reading (Chung & Zisserman, 2016) , the desired information is the fine-grained spatio-temporal representation around the mouth. However, if we directly apply existing CSL approaches, the shared information across views is that a there is a face, while the useful information, the lip movements, will be suppressed as they are changing across views from the sample clip. Motivated by this, we propose a versatile CSL approach to learn representations that can generalize to both scenarios that require global representations (e.g., classification) and scenarios that require local representations (e.g., localization) (see Fig. 1 ). Our approach, which we call global-local cross-modal (GLCM) contrastive learning, has four key properties that we assume to be important for our learning objective: 1) observations from the same time span of a video should reflect the same content regardless of modalities; 2) the same observations captured at different time scales can reflect both global and local information; 3) when learning on a local temporal scale, the contrasting views should only share the time-varying information (e.g. only the moving lip) while ignoring globally invariant information; 4) multi-scale (global-local) observations can be trained jointly in a collaborative way so that representations learned at either scale can be reused. Figure 1 : While many self-supervised approaches optimize for high-level or low-level tasks, we present an approach to learn both global and local representations from video. We formulate our GLCM objective using two cross-modal contrastive losses computed at multiple temporal scales. Specifically, we generate global and local views of a visual sequence at different sampling rates. The audio sequence is used as an anchor to contrast with global and local visual features, respectively. Consistent with the first property (see above) losses are computed at the same temporal scale, i.e. z g a ↔ z g v and z l a ↔ z l v , such that, given the same video sequence, the learned z g v and z l v reflect both global and local information; the latter portion satisfies the second property. To implement the third property, the local contrastive loss (z l a ↔ z l v ) is computed by considering only the audio-visual features that lie in the same time window as positive pairs; the others are all negative pairs. Finally, we utilize information captured at the global scale (e.g. localizing the source of a sound) to assist efficient learning at the local scale, thus capturing the fourth property. We show that GLCM pretraining learns representations with global and fine-grained spatio-temporal information from audio-visual signals. The learned representations perform effectively on a variety of downstream tasks. We evaluate our proposed approach on tasks that needs local spatio-temporal information (i.e lip reading, deep-fake detection, and sound-source localization) and also discriminative tasks that needs global information (i.e. action classification and audio-event classification).

2. RELATED WORK

Contrastive self-supervised learning. CSL has contributed to strong performance on many tasks and in cases produced comparable results to supervised learning (Chen et al., 2020; Caron et al., 2020) . Contrastive learning leverage multiple views of the same data (Hjelm & Bachman, 2020; Oord et al., 2018) , e.g., multiple perspectives within the same modality (e.g., augmentations of the same image, different frames of a video, etc.) (He et al., 2020; Hjelm & Bachman, 2020; Han et al., 2019a) or perspectives from different modalities (e.g., depth and RGB images, visual and textual signals) (Tian et al., 2019; Sun et al., 2019; Miech et al., 2020; Alayrac et al., 2020) . Chen et al. ( 2020) and Hjelm et al. (2018) show that leveraging local information to perform contrastive learning further improves the performance on image classification tasks. DIM (Hjelm et al., 2018) has been extended to multi-scale (Bachman et al., 2019) and video data Hjelm & Bachman (2020). However, evaluation is still focused on "discriminative" tasks (image classification and video event classification), while there is little evidence that these models will adapt well to the local information. Audio-visual representation learning. Several approaches have been proposed to leverage the natural correspondence between audio and visual signals to perform CSL (Asano et al., 2020; Korbar et al., 2018; Alwassel et al., 2019; Morgado et al., 2020; Patrick et al., 2020; Chung et al., 2019) . Most existing approaches aim to capture high-level semantic information from observations. It has been empirically demonstrated that such learned information is very effective for "discrimination tasks" (classification). However, in tasks that needs local information the learned representations may not perform well. Xiao et al. (2020a) design their approach by utilizing different temporal scales of the audio and visual data, which encourages the model to capture fine-grained temporal information and hence improves the performance. However, the evaluation was limited to classification tasks. In contrast with previous work, we demonstrate that our approach effectively learns global-local audio-visual representations by evaluating on a variety of downstream tasks.

