ACTIVE CONTRASTIVE LEARNING OF AUDIO-VISUAL VIDEO REPRESENTATIONS

Abstract

Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance. However, obtaining a tight lower bound requires a sample size exponential in MI and thus a large set of negative samples. We can incorporate more samples by building a large queue-based dictionary, but there are theoretical limits to performance improvements even with a large number of negative samples. We hypothesize that random negative sampling leads to a highly redundant dictionary that results in suboptimal representations for downstream tasks. In this paper, we propose an active contrastive learning approach that builds an actively sampled dictionary with diverse and informative items, which improves the quality of negative samples and improves performances on tasks where there is high mutual information in the data, e.g., video classification. Our model achieves state-of-the-art performance on challenging audio and visual downstream benchmarks including UCF101, HMDB51 and ESC50.

1. INTRODUCTION

Contrastive learning of audio and visual representations has delivered impressive results on various downstream scenarios (Oord et al., 2018; Hénaff et al., 2019; Schneider et al., 2019; Chen et al., 2020) . This self-supervised training process can be understood as building a dynamic dictionary per mini-batch, where "keys" are typically randomly sampled from the data. The encoders are trained to perform dictionary look-up: an encoded "query" should be similar to the value of its matching key and dissimilar to others. This training objective maximizes a lower bound of mutual information (MI) between representations and the data (Hjelm et al., 2018; Arora et al., 2019) . However, such lower bounds are tight only for sample sizes exponential in the MI (McAllester & Stratos, 2020) , suggesting the importance of building a large and consistent dictionary across mini-batches. Recently, He et al. ( 2020) designed Momentum Contrast (MoCo) that builds a queue-based dictionary with momentum updates. It achieves a large and consistent dictionary by decoupling the dictionary size from the GPU/TPU memory capacity. However, Arora et al. (2019) showed that simply increasing the dictionary size beyond a threshold does not improve (and sometimes can even harm) the performance on downstream tasks. Furthermore, we find that MoCo can suffer when there is high redundancy in the data, because only relevant -and thus limited -parts of the dictionary are updated in each iteration, ultimately leading to a dictionary of redundant items (we show this empirically in Fig. 3 ). We argue that random negative sampling is much responsible for this: a randomly constructed dictionary will contain more "biased keys" (similar keys that belong to the same class) and "ineffective keys" (keys that can be easily discriminated by the current model) than a carefully constructed one. Furthermore, this issue can get aggravated when the dictionary size is large. In this paper, we focus on learning audio-visual representations of video data by leveraging the natural correspondence between the two modalities, which serves as a useful self-supervisory signal (Owens & Efros, 2018; Owens et al., 2016; Alwassel et al., 2019) . Our starting point is contrastive learning (Gutmann & Hyvärinen, 2010; Oord et al., 2018) with momentum updates (He et al., 2020) .

