MULTI-MODAL SELF-SUPERVISION FROM GENERALIZED DATA TRANSFORMATIONS

Abstract

In the image domain, excellent representations can be learned by inducing invariance to content-preserving transformations, such as image distortions. In this paper, we show that, for videos, the answer is more complex, and that better results can be obtained by accounting for the interplay between invariance, distinctiveness, multiple modalities, and time. We introduce Generalized Data Transformations (GDTs) as a way to capture this interplay. GDTs reduce most previous selfsupervised approaches to a choice of data transformations, even when this was not the case in the original formulations. They also allow to choose whether the representation should be invariant or distinctive w.r.t. each effect and tell which combinations are valid, thus allowing us to explore the space of combinations systematically. We show in this manner that being invariant to certain transformations and distinctive to others is critical to learning effective video representations, improving the state-of-the-art by a large margin, and even surpassing supervised pretraining. We demonstrate results on a variety of downstream video and audio classification and retrieval tasks, on datasets such as HMDB-51, UCF-101, DCASE2014, ESC-50 and VGG-Sound. In particular, we achieve new state-ofthe-art accuracies of 72.8% on HMDB-51 and 95.2% on UCF-101.

1. INTRODUCTION

Recent works such as PIRL (Misra & van der Maaten, 2020) , MoCo (He et al., 2019) and Sim-CLR (Tian et al., 2019) have shown that it is possible to pre-train state-of-the-art image representations without the use of any manually-provided labels. Furthermore, many of these approaches use variants of noise contrastive learning (Gutmann & Hyvärinen, 2010) . Their idea is to learn a representation that is invariant to transformations that leave the meaning of an image unchanged (e.g. geometric distortion or cropping) and distinctive to changes that are likely to alter its meaning (e.g. replacing an image with another chosen at random). An analysis of such works shows that a dominant factor for performance is the choice of the transformations applied to the data. So far, authors have explored ad-hoc combinations of several transformations (e.g. random scale changes, crops, or contrast changes). Videos further allow to leverage the time dimension and multiple modalities. For example, Arandjelovic & Zisserman (2017); Owens et al. ( 2016) learn representations by matching visual and audio streams, as a proxy for objects that have a coherent appearance and sound. Their formulation is similar to noise contrastive ones, but does not quite follow the pattern of expressing the loss in terms of data transformations. Others (Chung & Zisserman, 2016; Korbar et al., 2018; Owens & Efros, 2018) depart further from standard contrastive schemes by learning representations that can tell whether visual and audio streams are in sync or not; the difference here is that the representation is encouraged to be distinctive rather than invariant to a time shift. Overall, it seems that finding an optimal noise contrastive formulation for videos will require combining several transformations while accounting for time and multiple modalities, and understanding how invariance and distinctiveness should relate to the transformations. However, the ad-hoc nature of these choices in previous contributions make a systematic exploration of this space rather difficult. In this paper, we propose a solution to this problem by introducing the Generalized Data Transformations (GDT; fig. 1 ) framework. GDTs reduce most previous methods, contrastive or not, to a noise contrastive formulation that is expressed in terms of data transformations only, making it ... simpler to systematically explore the space of possible combinations. This is true in particular for multi-modal data, where separating different modalities can also be seen as a transformation of an input video. The formalism also shows which combinations of different transformations are valid and how to enumerate them. It also clarifies how invariance and distinctiveness to different effects can be incorporated in the formulation and when doing so leads to a valid learning objective. These two aspects allows the search space of potentially optimal transformations to be significantly constrained, making it amenable to grid-search or more sophisticated methods such as Bayesian optimisation. By using GDTs, we make several findings. First, we find that using our framework, most previous pretext representation learning tasks can be formulated in a noise-contrastive manner, unifying previously distinct domains. Second, we show that just learning representations that are invariant to more and more transformations is not optimal, at least when it comes to video data; instead, balancing invariance to certain factors with distinctiveness to others performs best. Third, we find that by investigating what to be variant to can lead to large gains in downstream performances, for both visual and audio tasks. With this, we are able to set the new state of the art in audio-visual representation learning, with both small and large video pretraining datasets on a variety of visual and audio downstream tasks. In particular, we achieve 95.2% and 72.8% on the standardized UCF-101 and HMDB-51 action recognition benchmarks.

2. RELATED WORK

Self-supervised learning from images and videos. A variety of pretext tasks have been proposed to learn representations from unlabelled images. Some tasks leverage the spatial context in images (Doersch et al., 2015; Noroozi & Favaro, 2016) to train CNNs, while others create pseudo classification labels via artificial rotations (Gidaris et al., 2018) , or clustering features (Asano et al., 2020b; Caron et al., 2018; 2019; Gidaris et al., 2020; Ji et al., 2018) . Colorization (Zhang et al., 2016; 2017 ), inpainting (Pathak et al., 2016)) , solving jigsaw puzzles (Noroozi et al., 2017) , as well as the contrastive methods detailed below, have been proposed for self-supervised image representation learning. Some of the tasks that use the space dimension of images have been extended to the space-time dimensions of videos by crafting equivalent tasks. These include jigsaw puzzles (Kim et al., 2019) , and predicting rotations (Jing & Tian, 2018) or future frames (Han et al., 2019) . Other tasks leverage the temporal dimension of videos to learn representations by predicting shuffled frames (Misra et al., 2016) , the direction of time (Wei et al., 2018 ), motion (Wang et al., 2019) , clip and sequence order (Lee et al., 2017; Xu et al., 2019), and playback speed (Benaim et al., 2020; Cho et al., 2020; Fernando et al., 2017) . These pretext-tasks can be framed as GDTs. Multi-modal learning. Videos, unlike images, are a rich source of a variety of modalities such as speech, audio, and optical flow, and their correlation can be used as a supervisory signal. This



Fig. 1: Schematic overview of our framework. A: Hierarchical sampling process of generalized transformations T = t M • ... • t 1 for the multi-modal training study case. B: Subset of the c(T, T ) contrast matrix which shows which pairs are repelling (0) and attracting (1) (see text for details). C: With generalized data transformations (GDT), the network learns a meaningful embedding via learning desirable invariances and distinctiveness to transformations (realigned here for clarity) across modalities and time. The embedding is learned via noise contrastive estimation against clips of other source videos. Illustrational videos taken from YouTube (Google, 2020).

