LEARNING SPATIOTEMPORAL FEATURES VIA VIDEO AND TEXT PAIR DISCRIMINATION

Abstract

Current video representations heavily rely on learning from manually annotated video datasets which are time-consuming and expensive to acquire. We observe videos are naturally accompanied by abundant text information such as YouTube titles and Instagram captions. In this paper, we leverage this visual-textual connection to learn spatiotemporal features in an efficient weakly-supervised manner. We present a general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text. We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning, the learnt models obtain competitive results for action classification on Kinetics under the linear classification protocol. Moreover, our visual model provides an effective initialization to fine-tune on downstream tasks, which yields a remarkable performance gain for action recognition on UCF101 and HMDB51, compared with the existing state-of-the-art self-supervised training methods. In addition, our CPD demonstrates that pre-training a relatively small dataset is able to yield a comparable performance to those methods of using order magnitude more data, which is meaningful and practicable for the scenarios with limited computational facilities.

1. INTRODUCTION

Deep learning has made a remarkable progress for visual recognition in both image and video domain (Krizhevsky et al., 2012; He et al., 2016; Carreira & Zisserman, 2017; Feichtenhofer et al., 2018) by training powerful neural networks on large-scale manually annotated datasets (e.g., Ima-geNet (Deng et al., 2009) and Kinetics (Kay et al., 2017) ). More importantly, it is well-established that this supervised pre-training on large-scale datasets would benefit the downstream tasks (e.g., object detection (Ren et al., 2015) , pose estimation (He et al., 2017) , and temporal action detection (Zhao et al., 2017) ), in particular when the target datasets are relatively small. Yet, annotating a large-scale dataset for training such deep neural networks is costly and time-consuming, and even more challenging for video due to its various temporal structure and complex semantics. As a result, the existing video datasets size is still smaller than ImageNet in terms of training samples and classes. On the other hand, videos typically contain richer structure with abundant side information such as motion (Diba et al., 2019; Ng et al., 2018 ), audio (Arandjelovic & Zisserman, 2017; Korbar et al., 2018), and text (Miech et al., 2019; Sun et al., 2019b) . So these expected these associated modalities are expected to provide useful cues to learn video representations in a more efficient way. Language or text is probably the most natural and easy way to describe the semantic information of a video, and the associated textual information could be easily acquired when collecting video dataset (Rohrbach et al., 2017; Miech et al., 2019) from Internet or Movie. We argue that this correlation between a clip and its associated text could serve as an alternative supervision to learn video representation from scratch. This is different from some recent works (Sun et al., 2019b; Miech et al., 2019) , in which these abundant textual information has been used to learn a high-level visual-text embedding applied to text-to-video retrieval or video captioning. Intuitively, it is more challenging to learn a general visual representation solely from text information without any human annotation, for reasons such as large numbers of noise in text, lacking careful initialization, and being hard to design an effective objective. In this paper, we aim to learn effective video representation from noisy and diverse textual information, which could serves as the basis for a variety of downstream tasks. Basically, we learn a mapping of text and video into a shared embedding space and leverage their correlation as supervision signal. The technical difficulty is how to design an effective objective function, that is capable of modeling this complex visual-textual correlation and as well easily optimized by training from scratch on noisy datasets. Inspired by unsupervised feature learning in images (Wu et al., 2018; Tian et al., 2019) , we present a cross-modal pair discrimination (CPD) framework, which tries to recognize each video and text pair into a class via a non-parametric classifier. To solve the computational issues imposed by the huge numbers of pair classes, we adapt noise-contrastive estimation technique (Gutmann & Hyvärinen, 2010) to approximate the original loss function. Specifically, we learn the CPD framework from web videos with the associated title or caption that could be directly crawled from web platforms such as YouTube (Kay et al., 2017) and Instagram (Duan et al., 2020) . We utilize the off-the-shelf language models such as BERT (Devlin et al., 2019) or Word2vec (Mikolov et al., 2013) and devise a curriculum learning strategy to progressively train the video models. We first test the generalization ability of learned video representation by CPD on the Kinetics dataset (Kay et al., 2017) by using shallow classifiers such k-NN and linear classifier. It shows that our learned spatiotemporal features obtain promising results which are comparable to some supervised learning methods on the Kinetics dataset (Kay et al., 2017) . Then, we investigate the generalization power of learned spatiotemporal features of CPD by fine-tuning on the Kinetics (Kay et al., 2017), UCF101 (Soomro et al., 2012) and HMDB51 (Kuehne et al., 2011) datasets, demonstrating that our method obtain superior performance to previous state-of-the-art self-supervised methods and comparable performance to the very recent methods of using orders of magnitude more videos (70M-100M vs. 0.3M).

2. RELATED WORK

Self/Weakly Supervised Representation Learning. Self supervised representation was popular in both image and video domains by designing various proxy tasks. In image domain, for instance, these tasks could be predicting the image context (Doersch et al., 2015) , counting the objects (Noroozi et al., 2017) , converting gray images to color one (Zhang et al., 2016) , keeping global and local consistency (Hjelm et al., 2019) . In video domain, typical examples include frame prediction (Diba et al., 2019; Vondrick et al., 2016) , optical flow estimation (Ng et al., 2018; Zhou et al., 2017; Jayaraman & Grauman, 2017) , instance tracking (Wang & Gupta, 2015; Wang et al., 2019b) , temporal order or structure prediction (Misra et al., 2016; Fernando et al., 2017; Wei et al., 2018; Xu et al., 2019a) . These learnt representations may capture some aspects of low-level image or video structures, but are generally outperformed by those using cross modal information. Several cross-modal self-supervised tasks was proposed to enhance single-modality representation power and typical example is audio-visual representation learning (Aytar et al., 2016; Arandjelovic & Zisserman, 2017; Korbar et al., 2018) . Meanwhile, some weakly-supervised methods were developed by utilizing web supervision obtained in an automatic way, such as query ID (Chen & Gupta, 2015; Ghadiyaram et al., 2019), and hashtag (Mahajan et al., 2018) . Concurrent work (Miech et al., 2020) tried to learn video representations by using narration as supervision with instructional videos (e.g., HowTo100M (Miech et al., 2019) ). However, they are limited by the video type. Our CPD is applicable to more general video type and we experiment with a much smaller dataset (0.3M vs. 100M) of both PGC and UGC videos, but achieves a similar performance on UCF101 and HMDB51. Concurrent work (Stroud et al., 2020) proposed a similar framework but required more training videos (0.3M vs. 70M) and richer textual information to obtain similar performance to ours. Motion, Audio, and Text. Multi-modal information in videos provides natural cues for learning deep models. Motion or temporal information has been studied as to design proxy tasks to assist cross-modal learning, such as optical flow or tracking (Ng et al., 2018; Wang & Gupta, 2015) , frame prediction (Diba et al., 2019; Vondrick et al., 2016) , or high-level temporal structure (Wei et al., 2018; Xu et al., 2019a; Fernando et al., 2017) . As most video contain synchronized audio and visual signals, audio information has served another common modality to supervised visual learning (Aytar et al., 2016; Arandjelovic & Zisserman, 2017; Korbar et al., 2018) . However, both motion and audio information seem to be low-level signals and may lack high-level semantic for cross-modal learning. Speech or text has been widely studied as another cross-modal setting in video learning (Sun et al., 2019b; Miech et al., 2019; Dong et al., 2019; Miech et al., 2018; Pan et al., 2016; Plummer et al., 2017) . These works mainly aimed to learn a joint video-text embedding where visual and textual cues are adjacent if they are semantically. However, these works focused on learn high-level visual-

