MULTI-MODAL SELF-SUPERVISION FROM GENERALIZED DATA TRANSFORMATIONS

Abstract

In the image domain, excellent representations can be learned by inducing invariance to content-preserving transformations, such as image distortions. In this paper, we show that, for videos, the answer is more complex, and that better results can be obtained by accounting for the interplay between invariance, distinctiveness, multiple modalities, and time. We introduce Generalized Data Transformations (GDTs) as a way to capture this interplay. GDTs reduce most previous selfsupervised approaches to a choice of data transformations, even when this was not the case in the original formulations. They also allow to choose whether the representation should be invariant or distinctive w.r.t. each effect and tell which combinations are valid, thus allowing us to explore the space of combinations systematically. We show in this manner that being invariant to certain transformations and distinctive to others is critical to learning effective video representations, improving the state-of-the-art by a large margin, and even surpassing supervised pretraining. We demonstrate results on a variety of downstream video and audio classification and retrieval tasks, on datasets such as HMDB-51, UCF-101, DCASE2014, ESC-50 and VGG-Sound. In particular, we achieve new state-ofthe-art accuracies of 72.8% on HMDB-51 and 95.2% on UCF-101.

1. INTRODUCTION

Recent works such as PIRL (Misra & van der Maaten, 2020) , MoCo (He et al., 2019) and Sim-CLR (Tian et al., 2019) have shown that it is possible to pre-train state-of-the-art image representations without the use of any manually-provided labels. Furthermore, many of these approaches use variants of noise contrastive learning (Gutmann & Hyvärinen, 2010) . Their idea is to learn a representation that is invariant to transformations that leave the meaning of an image unchanged (e.g. geometric distortion or cropping) and distinctive to changes that are likely to alter its meaning (e.g. replacing an image with another chosen at random). An analysis of such works shows that a dominant factor for performance is the choice of the transformations applied to the data. So far, authors have explored ad-hoc combinations of several transformations (e.g. random scale changes, crops, or contrast changes). Videos further allow to leverage the time dimension and multiple modalities. For example, Arandjelovic & Zisserman (2017); Owens et al. ( 2016) learn representations by matching visual and audio streams, as a proxy for objects that have a coherent appearance and sound. Their formulation is similar to noise contrastive ones, but does not quite follow the pattern of expressing the loss in terms of data transformations. Others (Chung & Zisserman, 2016; Korbar et al., 2018; Owens & Efros, 2018) depart further from standard contrastive schemes by learning representations that can tell whether visual and audio streams are in sync or not; the difference here is that the representation is encouraged to be distinctive rather than invariant to a time shift. Overall, it seems that finding an optimal noise contrastive formulation for videos will require combining several transformations while accounting for time and multiple modalities, and understanding how invariance and distinctiveness should relate to the transformations. However, the ad-hoc nature of these choices in previous contributions make a systematic exploration of this space rather difficult. In this paper, we propose a solution to this problem by introducing the Generalized Data Transformations (GDT; fig. 1 ) framework. GDTs reduce most previous methods, contrastive or not, to a noise contrastive formulation that is expressed in terms of data transformations only, making it

