CONTRASTIVE SELF-SUPERVISED LEARNING OF GLOBAL-LOCAL AUDIO-VISUAL REPRESENTATIONS

Abstract

Contrastive self-supervised learning has delivered impressive results in many audio-visual recognition tasks. However, existing approaches optimize for learning either global representations useful for high-level understanding tasks such as classification, or local representations useful for tasks such as audio-visual source localization and separation. While they produce satisfactory results in their intended downstream scenarios, they often fail to generalize to tasks that they were not originally designed for. In this work, we propose a versatile self-supervised approach to learn audio-visual representations that can generalize to both the tasks which require global semantic information (e.g., classification) and the tasks that require fine-grained spatio-temporal information (e.g. localization). We achieve this by optimizing two cross-modal contrastive objectives that together encourage our model to learn discriminative global-local visual information given audio signals. To show that our approach learns generalizable video representations, we evaluate it on various downstream scenarios including action/sound classification, lip reading, deepfake detection, and sound source localization.

1. INTRODUCTION

Self-supervised learning aims to learn representations of data that generalize to a large variety of downstream tasks. Recently, contrastive self-supervised learning (CSL) has achieved impressive results on several computer vision tasks (Oord et al., 2018; Hjelm et al., 2018; He et al., 2020; Chen et al., 2020) . In CSL, the choice of "views" determines the types of information that the representation captures (Bachman et al., 2019) , as the framework learns representations that focus on the shared information between views. It has been demonstrated that the optimal choice of views depends critically on the downstream task (Tian et al., 2020) . Therefore, existing works mainly focus on finding different views tailored for the intended downstream tasks. For example, when tailoring views for action classification, Hjelm & Bachman (2020) extends DIM (Hjelm et al., 2018) to the spatio-temporal setting by assuming that global and local information useful for action classification (i.e, global semantics) should be invariant across time and space within a given video. When dealing with multimodal data, several approaches utilize audio-visual correspondence from videos (Morgado et al., 2020) . Such a CSL approach is based on an assumption that information needed for audio/video classification should be shared between the two modalities. Although they achieve impressive results in their intended downstream tasks, existing approaches often fail to generalize to tasks that they were not originally designed for. For example, in lip reading (Chung & Zisserman, 2016) , the desired information is the fine-grained spatio-temporal representation around the mouth. However, if we directly apply existing CSL approaches, the shared information across views is that a there is a face, while the useful information, the lip movements, will be suppressed as they are changing across views from the sample clip. Motivated by this, we propose a versatile CSL approach to learn representations that can generalize to both scenarios that require global representations (e.g., classification) and scenarios that require local representations (e.g., localization) (see Fig. 1 ). Our approach, which we call global-local cross-modal (GLCM) contrastive learning, has four key properties that we assume to be important for our learning objective: 1) observations from the same time span of a video should reflect the same content regardless of modalities; 2) the same observations captured at different time scales can reflect both global and local information; 3) when learning on a local temporal scale, the contrasting

