REVITALIZE REGION FEATURE FOR DEMOCRATIZING VIDEO-LANGUAGE PRE-TRAINING OF RETRIEVAL Anonymous

Abstract

Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval. Despite the impressive results, VLP research becomes extremely expensive with the need for massive data and a long training time, preventing further explorations. In this work, we revitalize region features of sparsely sampled video clips to significantly reduce both spatial and temporal visual redundancy towards democratizing VLP research at the same time achieving state-of-the-art results. Specifically, to fully explore the potential of region features, we introduce a novel bidirectional regionword alignment regularization that properly optimizes the fine-grained relations between regions and certain words in sentences, eliminating the domain/modality disconnections between pre-extracted region features and text. Extensive results of downstream video-language retrieval tasks on four datasets demonstrate the superiority of our method on both effectiveness and efficiency, e.g., our method achieves competing results with 80% fewer data and 85% less pre-training time compared to the most efficient VLP method so far (Lei et al., 2021).

1. INTRODUCTION

Video-language pre-training (VLP) (Lei et al., 2021; Li et al., 2020a; Miech et al., 2020) that jointly learns video and language representations in a self-supervised manner has become the most popular practice to cope video-language retrieval (Lee et al., 2018; Liu et al., 2019a) . Recently, end-to-end methods (Bain et al., 2021; Zellers et al., 2021) that learn video representations from raw pixels have dominated due to their strong performance on downstream tasks. Despite significant progress, these methods are quite data-hungry due to a large number of model parameters and uncurated raw inputs. The pre-training stage turns out to be inefficient and expensive with massive pre-training data and long pre-training time, making it difficult for researchers to pursue research in VLP. Previous work (Lei et al., 2021) attempts to lower the barrier for VLP via removing visual redundancy. They point out that video clips with sparsely sampled frames are sufficient enough to capture key semantics for pre-training, since adjacent frames often contain similar scenes. The effort enables more efficient VLP with competitive downstream performances. Besides the temporal visual redundancy, we argue that, in contrast to the text with highly abstract semantics, each frame of the video clips also has heavy spatial redundancy. Towards this end, we further propose to remove the redundant spatial information in sparsely sampled video clips via the claim that a frame is actually worth around 30 objects (based experiments in Section 4.4). Specifically, we revitalize offline region features that were all the rage in imagelanguage tasks (Liu et al., 2019a) to encourage efficient VLP. Region features are generally preextracted by a pre-learned object detector (Anderson et al., 2018) . Rather than the dense and continuous visual signal of the raw pixels, the region features are sparsely distributed with the compact information of salient visual contents, which are the most useful for video-text understanding. The sparse sampling significantly reduce the complexity of attention mechanism, which enables our model to have larger capacity with less FLOPs. We further advocate "less is more" for one more step towards democratizing VLP research. As is known, methods using off-the-shelf features (Lee et al., 2018) have been phased out in visuallanguage tasks due to the inferior downstream performances. Previous work (Lei et al., 2021) tributes the unsatisfactory pre-training performance of pre-extracted features to their disconnections with the current domain and language modality. We would like to clarify that such disconnections can be properly eliminated by imposing fine-grained cross-modality alignment regularization. Specifically, besides the common late fusion regularization on the global visual-text representations (Bain et al., 2021) , we introduce a novel bidirectional region-word alignment regularization under the observation that objects extracted from video frames are naturally associated with certain words in the corresponding sentences. For instance, as demonstrated in Fig. 1 , the keywords "people", "car" and "bicycle" share high-level semantics with cropped regions (highlighted with bounding boxes), respectively. To model and promote such a detailed cross-modality relationship, we build bidirectional connections between extracted regions and words. In the Region→Word manner, we estimate the region-to-sentence similarity resorting to the similarities between each region and all the words in a sentence. The average region-to-sentence similarity over all the regions of a video clip is treated as the video-to-sentence similarity, which is further maximized for positive pairs. Similarly, the Word→Region manner is conducted to measure and optimize the sentence-to-video similarity according to the similarities between each word and the corresponding regions. We surprisingly find that the proposed fine-grained region-word alignment constraints can also be seamlessly integrated into end-to-end VLP methods (Bain et al., 2021) , achieving promising performance gains. In summary, our contributions are three-fold: (1) We revitalize region features towards democratizing VLP via removing both temporal and spatial visual redundancy. Specifically, our efficient VLP model can maintain state-of-the-art performance on multiple downstream tasks with 80% fewer data and 85% less pre-training time than ClipBERT, which is the most efficient end-to-end VLP method so far. (2) We clarify that the inferior performance of off-the-shelf features in previous attempts (Li et al., 2020a; Zhu & Yang, 2020; Sun et al., 2019; Yu et al., 2018; Gabeur et al., 2020) lies in the sub-optimal learning regularization. We tackle the challenge with a newly proposed bidirectional region-word constraint, which optimizes fine-grained visual-text relations and properly eliminates the domain/modality disconnections of the region features. (3) Our method shows competitive results on four downstream video-language retrieval tasks. We surprisingly observe that the introduced region-word alignment regularization can also effectively boost the end-to-end method (Bain et al., 2021) with noticeable improvements.

2. RELATED WORK

Video-Language Pre-training. Early VLP methods (Li et al., 2020a; Zhu & Yang, 2020; Sun et al., 2019; Yu et al., 2018; Gabeur et al., 2020) introduce pretrained models on other tasks to pre-extract video representations. Some of them (Li et al., 2020a; Zhu & Yang, 2020; Sun et al., 2019) utilize action recognition backbones (Feichtenhofer et al., 2019; Hara et al., 2018) to pre-extract video representations. These backbones are designed with 2D (He et al., 2016) and 3D (Hara et al., 2018) CNNs to capture spatial and temporal information in videos. Others (Yu et al., 2018; Liu et al., 2019b; Gabeur et al., 2020; Wang et al., 2021b) fuse multiple "Experts" that are trained on different modalities, such as audio classification (Hershey et al., 2017) , OCR (Gabeur et al., 2020) , image classification (Huang et al., 2017) and so on, to fully exploit cross-modal high-level semantics in videos. Recently, end-to-end models (Miech et al., 2020; Lei et al., 2021; Bain et al., 2021; Zellers et al., 2021; Fu et al., 2021) are proposed . Some (Miech et al., 2020; Lei et al., 2021; Zellers et al., 2021) utilize CNNs to extract video features, others (Bain et al., 2021; Fu et al., 2021) replace CNNs with ViT (Dosovitskiy et al., 2021) to build a pure Transformer-based VLP model.



Figure 1: Region-word alignment and results. (a) Region-word alignment is to reason the detailed correspondence between salient regions and words. (b) and (c) demonstrate that our method significantly improves text-to-video retrieval while reducing pre-training time to a large extent.

