ACTIVE CONTRASTIVE LEARNING OF AUDIO-VISUAL VIDEO REPRESENTATIONS

Abstract

Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance. However, obtaining a tight lower bound requires a sample size exponential in MI and thus a large set of negative samples. We can incorporate more samples by building a large queue-based dictionary, but there are theoretical limits to performance improvements even with a large number of negative samples. We hypothesize that random negative sampling leads to a highly redundant dictionary that results in suboptimal representations for downstream tasks. In this paper, we propose an active contrastive learning approach that builds an actively sampled dictionary with diverse and informative items, which improves the quality of negative samples and improves performances on tasks where there is high mutual information in the data, e.g., video classification. Our model achieves state-of-the-art performance on challenging audio and visual downstream benchmarks including UCF101, HMDB51 and ESC50.

1. INTRODUCTION

Contrastive learning of audio and visual representations has delivered impressive results on various downstream scenarios (Oord et al., 2018; Hénaff et al., 2019; Schneider et al., 2019; Chen et al., 2020) . This self-supervised training process can be understood as building a dynamic dictionary per mini-batch, where "keys" are typically randomly sampled from the data. The encoders are trained to perform dictionary look-up: an encoded "query" should be similar to the value of its matching key and dissimilar to others. This training objective maximizes a lower bound of mutual information (MI) between representations and the data (Hjelm et al., 2018; Arora et al., 2019) . However, such lower bounds are tight only for sample sizes exponential in the MI (McAllester & Stratos, 2020) , suggesting the importance of building a large and consistent dictionary across mini-batches. Recently, He et al. ( 2020) designed Momentum Contrast (MoCo) that builds a queue-based dictionary with momentum updates. It achieves a large and consistent dictionary by decoupling the dictionary size from the GPU/TPU memory capacity. However, Arora et al. (2019) showed that simply increasing the dictionary size beyond a threshold does not improve (and sometimes can even harm) the performance on downstream tasks. Furthermore, we find that MoCo can suffer when there is high redundancy in the data, because only relevant -and thus limited -parts of the dictionary are updated in each iteration, ultimately leading to a dictionary of redundant items (we show this empirically in Fig. 3 ). We argue that random negative sampling is much responsible for this: a randomly constructed dictionary will contain more "biased keys" (similar keys that belong to the same class) and "ineffective keys" (keys that can be easily discriminated by the current model) than a carefully constructed one. Furthermore, this issue can get aggravated when the dictionary size is large. In this paper, we focus on learning audio-visual representations of video data by leveraging the natural correspondence between the two modalities, which serves as a useful self-supervisory signal (Owens & Efros, 2018; Owens et al., 2016; Alwassel et al., 2019) . Our starting point is contrastive learning (Gutmann & Hyvärinen, 2010; Oord et al., 2018) with momentum updates (He et al., 2020) . However, as we discussed above, there are both practical challenges and theoretical limits to the dictionary size. This issue is common to all natural data but is especially severe in video; successive frames contain highly redundant information, and from the information-theoretic perspective, audiovisual channels of video data contain higher MI than images because the higher dimensionalityi.e., temporal and multimodal -reduces the uncertainty between successive video clips. Therefore, a dictionary of randomly sampled video clips would contain highly redundant information, causing the contrastive learning to be ineffective. Therefore, we propose an actively sampled dictionary to sample informative and diverse set of negative instances. Our approach is inspired by active learning (Settles, 2009) that aims to identify and label only the maximally informative samples, so that one can train a high-performing classifier with minimal labeling effort. We adapt this idea to construct a non-redundant dictionary with informative negative samples. Our approach, Cross-Modal Active Contrastive Coding (CM-ACC), learns discriminative audiovisual representations and achieves substantially better results on video data with a high amount of redundancy (and thus high MI). We show that our actively sampled dictionary contains negative samples from a wider variety of semantic categories than a randomly sampled dictionary. As a result, our approach can benefit from large dictionaries even when randomly sampled dictionaries of the same size start to have a deleterious effect on model performance. When pretrained on AudioSet (Gemmeke et al., 2017), our approach achieves new state-of-the-art classification performance on UCF101 (Soomro et al., 2012 ), HMDB51 (Kuehne et al., 2011 ), and ESC50 (Piczak, 2015b) .

2. BACKGROUND

Contrastive learning optimizes an objective that encourages similar samples to have similar representations than with dissimilar ones (called negative samples) (Oord et al., 2018) : min θ f ,θ h E x p X -log e f (x;θ f ) h(x + ;θ h ) e f (x;θ f ) h(x + ;θ h ) + e f (x;θ f ) h(x -;θ h ) The samples x + and x -are drawn from the same distribution as x ∈ X , and are assumed to be similar and dissimilar to x, respectively. The objective encourages f (•) and h(•) to learn representations of x such that (x, x + ) have a higher similarity than all the other pairs of (x, x -). We can interpret it as a dynamic dictionary look-up process: Given a "query" x, it finds the correct "key" x + among the other irrelevant keys x -in a dictionary. Denoting the query by q = f (x), the correct key by k + = h(x + ), and the dictionary of K negative samples by {k i = h(x i )}, i ∈ [1, K], we can express equation 1 in a softmax form, min θq,θ k E x p X -log e q•k + /τ K i=0 e q•k i /τ , where θ q and θ k are parameters of the query and key encoders, respectively, and τ is a temperature term that controls the shape of the probability distribution computed by the softmax function. Momentum Contrast (MoCo) decouples the dictionary size from the mini-batch size by implementing a queue-based dictionary, i.e., current mini-batch samples are enqueued while the oldest are dequeued (He et al., 2020) . It then applies momentum updates to parameters of a key encoder θ k with respect to parameters of a query encoder, θ k ← mθ k + (1 -m)θ q , where m ∈ [0, 1) is a momentum coefficient. Only the parameters θ q are updated by back-propagation, while the parameters θ k are defined as a moving average of θ q with exponential smoothing. These two modifications allow MoCo to build a large and slowly-changing (and thus consistent) dictionary. Theoretical Limitations of Contrastive Learning. Recent work provides theoretical analysis of the shortcomings of contrastive learning. McAllester & Stratos (2020) show that lower bounds to the MI are only tight for sample size exponential in the MI, suggesting that a large amount of data are required to achieve a tighter lower bound on MI. He et al. (2020) empirically showed that increasing negative samples has shown to improve the learned presentations. However, Arora et al. (2019) showed that such a phenomenon does not always hold: Excessive negative samples can sometimes hurt performance. Also, when the number of negative samples is large, the chance of sampling redundant instances increases, limiting the effectiveness of contrastive learning. One of our main contributions is to address this issue with active sampling of negative instances, which reduces redundancy and improves diversity, leading to improved performance on various downstream tasks.

