LEARNING SPATIOTEMPORAL FEATURES VIA VIDEO AND TEXT PAIR DISCRIMINATION

Abstract

Current video representations heavily rely on learning from manually annotated video datasets which are time-consuming and expensive to acquire. We observe videos are naturally accompanied by abundant text information such as YouTube titles and Instagram captions. In this paper, we leverage this visual-textual connection to learn spatiotemporal features in an efficient weakly-supervised manner. We present a general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text. We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning, the learnt models obtain competitive results for action classification on Kinetics under the linear classification protocol. Moreover, our visual model provides an effective initialization to fine-tune on downstream tasks, which yields a remarkable performance gain for action recognition on UCF101 and HMDB51, compared with the existing state-of-the-art self-supervised training methods. In addition, our CPD demonstrates that pre-training a relatively small dataset is able to yield a comparable performance to those methods of using order magnitude more data, which is meaningful and practicable for the scenarios with limited computational facilities.

1. INTRODUCTION

Deep learning has made a remarkable progress for visual recognition in both image and video domain (Krizhevsky et al., 2012; He et al., 2016; Carreira & Zisserman, 2017; Feichtenhofer et al., 2018) by training powerful neural networks on large-scale manually annotated datasets (e.g., Ima-geNet (Deng et al., 2009) and Kinetics (Kay et al., 2017) ). More importantly, it is well-established that this supervised pre-training on large-scale datasets would benefit the downstream tasks (e.g., object detection (Ren et al., 2015) , pose estimation (He et al., 2017) , and temporal action detection (Zhao et al., 2017) ), in particular when the target datasets are relatively small. Yet, annotating a large-scale dataset for training such deep neural networks is costly and time-consuming, and even more challenging for video due to its various temporal structure and complex semantics. As a result, the existing video datasets size is still smaller than ImageNet in terms of training samples and classes. On the other hand, videos typically contain richer structure with abundant side information such as motion (Diba et al., 2019; Ng et al., 2018 ), audio (Arandjelovic & Zisserman, 2017; Korbar et al., 2018) , and text (Miech et al., 2019; Sun et al., 2019b) . So these expected these associated modalities are expected to provide useful cues to learn video representations in a more efficient way. Language or text is probably the most natural and easy way to describe the semantic information of a video, and the associated textual information could be easily acquired when collecting video dataset (Rohrbach et al., 2017; Miech et al., 2019) from Internet or Movie. We argue that this correlation between a clip and its associated text could serve as an alternative supervision to learn video representation from scratch. This is different from some recent works (Sun et al., 2019b; Miech et al., 2019) , in which these abundant textual information has been used to learn a high-level visual-text embedding applied to text-to-video retrieval or video captioning. Intuitively, it is more challenging to learn a general visual representation solely from text information without any human annotation, for reasons such as large numbers of noise in text, lacking careful initialization, and being hard to design an effective objective. In this paper, we aim to learn effective video representation from noisy and diverse textual information, which could serves as the basis for a variety of downstream tasks. Basically, we learn a

