CLIP-VIP: ADAPTING PRE-TRAINED IMAGE-TEXT MODEL TO VIDEO-LANGUAGE ALIGNMENT

Abstract

Pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, there are works that transfer image representation to the video domain and achieve good results. However, adapting image-text pre-trained models to video-text pre-training (i.e., post-pretraining) has not demonstrated a significant advantage yet. In this paper, we tackle this challenge by raising and addressing two questions: 1) what are the factors hindering post-pretraining CLIP from improving performance on video-text tasks, and 2) how to mitigate the impact of these factors. Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have large impacts. By these observations, we propose an Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model achieves state-of-the-art results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and Ac-tivityNet. We release our code and pre-trained CLIP-ViP models at https: //github.com/microsoft/XPretrain/tree/main/CLIP-ViP.

1. INTRODUCTION

In the past few years, vision-language pre-training has achieved great success on cross-modal representation learning from a large scale of web-crawled data (Radford et al., 2021; Jia et al., 2021; Li et al., 2021; Wang et al., 2021b; Zellers et al., 2021; 2022; Bain et al., 2021) . Among them, imagetext pre-trained models (Radford et al., 2021; Jia et al., 2021) have shown powerful capability for various downstream tasks, including visual understanding (Gu et al., 2021; Wang et al., 2021a; Rao et al., 2022) , image-text generation (Patashnik et al., 2021; Mokady et al., 2021) and so on (Guzhov et al., 2022; Zhang et al., 2022) . In light of the well-learned and enriched visual representation, some works directly adapt image-text pre-trained models to video-text downstream tasks without further pre-training on video data (Luo et al., 2021; Fang et al., 2021; Gorti et al., 2022; Zhao et al., 2022) , while still outperforming models pre-trained on video data (Xu et al., 2021b; Bain et al., 2021) . Utilizing an existing powerful image-text pre-trained model for further video-text pre-training (i.e., post-pretraining) is able to reduce the required training cost by making good use of the knowledge learned from images. However, adapting image-text pre-trained models to video-text data for postpretraining has not demonstrated a significant advantage yet, and thus is still under-explored. A preliminary study is conducted by CLIP4Clip (Luo et al., 2021) which adopts MeanPooling by averaging multiple frame features based on the CLIP model on a subset of Howto100M (Miech et al., 2019) . While the improvement over directly using the image-text pre-trained model is marginal for either zero-shot or fine-tuning settings. In this paper, we aim to explore how to effectively adapt the image-text pre-trained model (e.g., CLIP) to video-language representation learning for video-text tasks (e.g., text-to-video retrieval). To unleash the power of video data to adapt image-text pre-trained models for post-pretraining, we conduct several preliminary experiments to figure out the challenges that hinder post-pretraining. First, we explore post-pretraining an image-text pre-trained model (i.e., CLIP) with MeanPooling on video-text datasets with different scales, including WebVid-2.5M (Bain et al., 2021) and HD-VILA-100M (Xue et al., 2022) . The result shows that the scale of data is critical for video-text post-pretraining. Data on a small scale makes the model easy to over-fit the new data while the knowledge learned from image-text is suppressed and the performance is reduced. Second, we investigate the language domain gap between pre-training data and downstream data. By calculating the Normalized Mutual Information (NMI) on clusters of text features, we find that there is a large domain gap between subtitles that are used in large-scale video-text pre-training data and descriptive texts in downstream tasks. To mitigate the impact of the above factors, we propose CLIP-ViP to adapt the pre-trained image-text model CLIP for video-text pre-training. First, we introduce auxiliary captions that have a smaller language domain gap with downstream data into existing large-scale video-text data. Instead of using a video captioning model which may cause data leakage by training on the same dataset with video-text downstream tasks, and considering a better visual captioning capability, we adopt an image captioning model to generate an auxiliary caption of middle frame in each video. In order to adapt a Transformer-based vision encoder to process both images and videos with minimal modification, we then propose video proxy tokens and design a proxy-guided video attention mechanism for the Vision Transformer (ViT). Specifically, during attention computation in each block, video proxy tokens can interact with all tokens, while patch tokens only interact with video proxy tokens and patch tokens within the same frame. Our vision encoder only increases negligible parameters and calculations compared to the vanilla Vision Transformer while increasing the generality and extendability. To facilitate cross-modal representation learning from both caption-frame and video-subtitle data types at the same time, we propose an Omnisource Cross-modal Learning (OCL) method for pre-training and study a series of variants to find the best fusion strategy. Our experimental results show that our approach improves the performance of CLIP on text-to-video retrieval tasks by a large margin. We also conduct ablation studies to verify the effectiveness of each part in our approach. Our contributions are summarized as follows: (1) We are one of the first to explore factors that hinder video post-pretraining on pre-trained image-text models; (2) We propose CLIP-ViP that can effectively leverage image-text pre-trained model for post-pretraining; (3) We conduct extensive experiments to verify the effectiveness of our method. Our model outperforms the state-of-the-art results by a large margin on four widely-used benchmarks.

2. RELATED WORK

Vision-Language Pre-Training End-to-end models (Lei et al., 2021; Xue et al., 2022; Zellers et al., 2021; Fu et al., 2021; Huang et al., 2020; 2021b; Xue et al., 2021; Li et al., 2021; Kim et al., 2021; Huang et al., 2021a; Sun et al., 2022) for vision-language pre-training are replacing the traditional approach using pre-extracted visual features by off-the-shelf models (Sun et al., 2019; Xu et al., 2021b; Zhu & Yang, 2020; Li et al., 2020b; a; Chen et al., 2020) . Training end-to-end models on large-scale web-collected data also gradually demonstrates the big advantages (Radford et al., 2021; Jia et al., 2021; Xue et al., 2022; Zellers et al., 2021; 2022) . Unlike images that have alttexts, large-scale video datasets suitable for pre-training usually use subtitles as text sources (Miech et al., 2019; Xue et al., 2022) . Subtitles are much noisier than alt-texts, according to (Miech et al., 2019) , typical examples of incoherence include the content producer asking viewers to subscribe to their channel, talking about something unrelated to the video, or describing something before or after it happens. Bain et al. collect a video dataset WebVid (Bain et al., 2021) with textual description annotations. Their texts are well aligned with the video and avoid suffering from ASR errors. However, the vast majority of WebVid videos are sourced from a stock footage website, so scaling up is under limitation. The video-subtitle data is more easily accessible on the web and thus suitable for scaling up. In this paper, we investigate the unfavorable factors of video-subtitle data and explore how to mitigate the impact of these factors. Pre-trained Models for Video-Text Retrieval The great success of the CLIP has demonstrated its unprecedented power on varies downstream tasks, including vision understanding (Gu et al., 2021; Wang et al., 2021a; Rao et al., 2022) , image-text generation (Patashnik et al., 2021; Mokady et al., 2021) and so on (Guzhov et al., 2022; Zhang et al., 2022) . By contrastive learning on large-scale image-text pairs, CLIP learns enriched visual concepts for images. Recently, some works directly transfer CLIP to video-text retrieval without further pretraining on video data (post-pretraining) (Luo et al., 2021; Fang et al., 2021; Gorti et al., 2022; Zhao et al., 2022; Wang et al., 2022c) . Their work takes the performance of video-text retrieval to a new level, outperforming existing models pre-trained on video data (Xu et al., 2021b; Bain et al., 2021; Xue et al., 2022; Ge et al., 2022; Wang et al., 2022a) . They transfer CLIP from views of feature aggregation (Luo et al., 2021; Zhao et al., 2022; Fang et al., 2021; Gorti et al., 2022) or representation alignment (Fang et al., 2021; Gorti et al., 2022; Wang et al., 2022c) . In parallel with these works, we study post-pretraining with video data on top of CLIP in an effective way and our model can be combined with other approaches effectively.

3. PRELIMINARY ANALYSIS

In this section, we first study the impact of the data scale for adapting image-text pre-training to video-text post-pretraining, and then investigate how the language domain gap affects the adaption.

3.1. POST-PRETRAINING WITH DIFFERENT DATA SCALES

To study the effectiveness of different data scales, we use the CLIP-ViT-B/32 model (Radford et al., 2021) as the base image-text pre-trained model and adopt MeanPooling for video adaption like CLIP4Clip (Luo et al., 2021) by averaging multiple frame features as video feature. Two opendomain video-text datasets are used: WebVid-2.5M (Bain et al., 2021) with 2.5 million pairs and HD-VILA-100M (Xue et al., 2022) with 100M pairs. We also adopt a subset of HD-VILA-100M containing random 10% data (namely HD-VILA-10M) as a middle setting. We run the same number of steps on all settings, equivalent to one epoch on HD-VILA-100M. We uniformly sample 12 frames from each video and apply the same hyper-parameters as described in Section 5 for all settings. During post-pretraining, we evaluate the pre-trained models by fine-tuning on MSR-VTT text-tovideo retrieval task. Figure 1 shows the performance trend. We observe an overfitting phenomenon that continuous post-pretraining leads to a performance drop. And the drop is more significant for smaller data (e.g., WebVid-2.5M and HD-VILA-10M). As CLIP is pre-trained on 400 million image-text pairs, further training on small data makes the model tend to overfit the new data while the implicit knowledge learned from the image-text pairs is degrading. As a consequence, the performance will drop, even worse than using CLIP directly. Thus we adopt HD-VILA-100M due to its large scale and diverse category.

3.2. LANGUAGE DOMAIN GAP WITH DOWNSTREAM DATA

It is intuitive that pre-training on data with the same domain as downstream data can benefit downstream tasks. For most video-text tasks like video-text retrieval, texts are descriptive sentences of videos (i.e., captions). While for HD-VILA-100M, which we will use for pre-training, the texts are auto-transcribed subtitles and they indicate very different relevance to visual information compared to descriptive texts. Meanwhile, auto-transcribed subtitles suffer from irrelevance, misalignment, and ASR errors (Miech et al., 2019) . To better explore the language domain gap between pre- training data and downstream data, we measure the inconsistency by calculating the dissimilarity between their language features. For downstream language data, we choose two typical video-text retrieval datasets: MSR-VTT (Xu et al., 2016) and DiDeMo (Anne Hendricks et al., 2017) . For pretraining language, we select four types: video subtitles of HD-VILA-100M (HD-VILA sub ), video captions of WebVid-2.5M, image captions of MS-COCO (Lin et al., 2014) , and web-collected alttexts of Conceptual Caption 12M (Changpinyo et al., 2021) . In addition, we analyze auto-generated captions of HD-VILA-100M (HD-VILA cap ), which will be introduced in Section 4.

Text Encoder

L V ↔S + L F ↔C ; (b) L V ↔S + L V ↔C ; (c) L V ↔S + L V ↔C + L F ↔C ; (d) L V ↔S,C + L F ↔C . We use a Transformer Encoder initialized from CLIP (Radford et al., 2021) to extract text features. To quantify the domain gap of languages between pre-training and downstream data, we first mix their text features and then use K-means to get two clusters. Then we calculate the Normalized Mutual Information (NMI) between cluster labels and ground-truth labels of pre-training or downstream. A larger NMI value means that the two types of features are easy to be distinguished, thus there is a larger domain gap. For each comparison, we randomly sample 1000 texts from each type of data for 10 times and adopt the average of 10 results. We report the results in Table 1 . Comparing the values of all pre-training data types, we find that the NMI score between HD-VILA sub and downstream data is much larger than others, especially for MSR-VTT downstream dataset. This indicates that direct training with subtitles may introduce inconsistency with downstream tasks.

4. APPROACH

In this section, we will introduce the proposed CLIP-ViP video pre-training framework. To bridge language domain gaps between image and video datasets, we first introduce an in-domain auxiliary data generation method. Then, we propose a novel Video Proxy mechanism to enable the Vision Transformer (ViT) model for both image and video encoding. We further present an Omnisource Cross-modal Learning (OCL) method which can jointly learn cross-modal representation from video-text and image-text pairs.

4.1. IN-DOMAIN AUXILIARY DATA GENERATION

Motivated by the analysis in Section 3, we introduce auxiliary captions into large-scale video-subtitle data to reduce the language domain gap between pre-training and downstream data. We adopt an image captioning model for two reasons. 1) Most SOTA video captioning models are trained with video-text datasets (e.g., MSR-VTT, ActivityNet) which are also used for downstream tasks. We avoid data leakage to perform pre-training agnostic to downstream data. 2) The performance of existing video captioning models lags far behind that of images. Thus, we choose a powerful image captioning model OFA-Caption (Wang et al., 2022b) to generate one caption for the middle frame of each video in HD-VILA-100M. We use the default setting of the OFA-Caption model. As a result, we generate 100M sentences with a max length of 16 words. This method can be applied to any video data and we will release the generated captions to facilitate future research.

4.2. VIDEO PROXY MECHANISM

Since video is an ordered sequence of frames, it is critical to learn the frame aggregation and temporality when transferring to the video domain. Meanwhile, to keep the high generality and extendability of the Vision Transformer (ViT) backbone, we aim to find a simple but effective way to transfer ViT to enable both image and video encoding with minimal modifications. Given a video containing T frames: {f 1 , f 2 , ..., f T }, we follow CLIP to divide each frame into N patches: {f 1 t , f 2 t , ..., f N t | t ∈ [1, T ]}. Then we add spatio-temporal positional embedding to each flattened 2D patches: g(f n t ) = Linear(f n t ) + P os s (n) + P os t (t), (1) where Linear( * ) is a linear layer, P os s (n) and P os t (t) is the learnable spatial and temporal positional embedding, respectively. The whole video can be divided into T × N patch tokens. To model spatial information from multi-frames, one simple way is directly feeding all tokens into CLIP's vision encoder and conducting attention across all tokens. However, this method introduces significant conflicts with CLIP. As CLIP is pre-trained on image and text pairs, it has difficulty handling interactions of tokens between images/frames during training. We also verify it by experiments as Full Attention setting in Table 2 . Instead, we introduce a Video Proxy token to act as a proxy that helps each local patch perceive video-level temporal information. Before feeding into CLIP, we concatenate patch tokens with a set of learnable parameters called video proxy tokens: P = {p 1 , p 2 , ..., p M }, where M is the number of video proxy tokens. Then all T × N + M tokens will be fed into the ViT of CLIP. The output of the first video proxy token will be regarded as the video's representation. We also design a proxy-guided attention mechanism for the vanilla ViT. In the attention score calculation of each block, video proxy tokens attend to all tokens, while patch tokens only attend to tokens in the same frame plus video proxy tokens. This mechanism can be formulated as attention mask M ViP : M ViP (q, k) = 1 if q ∈ P or k ∈ P or (q, k) in the same frame, 0 otherwise, where q and k is the key and query in attention calculation. Patch tokens can obtain global information from video proxy tokens while reducing inconsistencies with the original CLIP's calculation. Our experiment in Section 5 demonstrates the superiority of this mechanism. For the input type of the image/frame, we use linear interpolation to get a middle temporal positional embedding, then treat the image/frame as a special single-frame video. This method enables joint training on both videos and images in the same batch, as our proxy-guided attention mechanism reduces the difference in calculations between video and image.

4.3. OMNISOURCE CROSS-MODAL LEARNING

To learn rich video-language alignment from video-subtitle pairs and reduce the language domain gap with downstream data by corresponding auxiliary frame-caption pairs, we study joint Cross-Modal Learning on the omnisource input. Following most works of learning multimodal alignment on dual encoders (Radford et al., 2021; Xue et al., 2022; Li et al., 2021; Luo et al., 2020; Xu et al., 2021b; Luo et al., 2021) , we use info-NCE loss to perform contrastive learning. There are two formats of visual source : video sequences and single frames, and two types of text source : subtitles and captions in our work. We denote them by V , F , S, and C respectively. We define a source-wise info-NCE loss by: where v i and t j are the normalized embeddings of i-th visual feature in X ∈ {V, F } and j-th text feature in Y ∈ {S, C} in a batch of size B. τ is a learnable temperature. The overall alignment loss L X↔Y is the average of L v2t and L t2v . For example, L V ↔S represents info-NCE loss within video-subtitle pairs in a batch, which pulls aligned pairs together in embedding space while pushing apart misaligned pairs. L v2t = - 1 B B i=1 log e v ⊤ i ti/τ B j=1 e v ⊤ i tj /τ , L t2v = - 1 B B i=1 log e t ⊤ i vi/τ B j=1 e t ⊤ i vj /τ (3) Model R@1 ↑ R@5 ↑ R@ Retrieval R@1 ↑ R@5 ↑ R@10 ↑ Mean ↑ R@1 ↑ R@5 ↑ R@ We study the reasonable variants of OCL: (a) L V ↔S + L F ↔C : Simple combination of two sourcewise losses on video-subtitle and frame-caption pairs; (b) L V ↔S + L V ↔C : As there is also content correlation between videos and its middle-frame captions, we explore to add a loss on videocaption pairs to baseline loss L V ↔S ; (c) L V ↔S + L V ↔C + L F ↔C : Combination of (a) and (c); (d) L V ↔S,C + L F ↔C : A video corresponds to both a subtitle and auxiliary caption. Compare to (c), the numbers of negative pairs in L v2t can be expanded. The L v2t in L V ↔S,C is rewritten as: L v2t = - 1 2B B i=1 (log e v ⊤ i si/τ B j=1 e v ⊤ i sj /τ + e v ⊤ i c j̸ =i /τ + log e v ⊤ i ci/τ B j=1 e v ⊤ i cj /τ + e v ⊤ i s j̸ =i /τ ), where s i ∈ S and c i ∈ C. The L t2v in L V ↔S,C is equal to (c). We compare all variants with the baseline L V ↔S and report results in Section 5.

5.1. EXPERIMENTAL DETAILS

Video-Text Post-Pretraining. To pre-train the proposed CLIP-ViP model, we uniformly sample 12 frames and resize all frames to 224×224 from video clips with an average length of 13.4 seconds. The sampled frames can well cover the semantics conveyed from a video clip. For text, we adopt the CLIP's tokenizer to split a sentence into word tokens with a max length of 70. We use AdamW optimizer (Loshchilov & Hutter, 2019) , and empirically set an initial learning rate as 5e-6 and a fixed weight decay as 5e-2. For the learning rate schedule, we adopt a cosine decay with a warmup strategy. We train our model with 32 NVIDIA Tesla V100 GPUs in a batch size of 1024. The contrastive similarity is calculated on gathered features from all GPUs. We set training steps to one epoch on HD-VILA-100M for all ablation studies and three epochs for the full setting. Fine-tuning Training. To better adapt CLIP-ViP to downstream tasks, we reuse most hyperparameters of post-pretraining in fine-tuning with some exceptions. 1) Batch size: we fine-tune Table 4 : Ablation study of post-pretrain data. We report text-to-video results of models finetuned on MSR-VTT and DiDeMo. Mean ↑ indicates an average of Recall@1, 5 and 10. For all results, the model is designed with 4 video proxy tokens and pre-trained on CLIP-ViT-B/32. All post-pretraining steps are equivalent to one epoch on HD-VILA-100M. Post-pretrain Data MSR-VTT Retrieval DiDeMo Retrieval R@1 ↑ R@5 ↑ R@10 ↑ Mean ↑ R@1 ↑ R@5 ↑ R@10 ↑ Mean ↑ w/o our model with a batch size of 128 for all downstream tasks for a fair comparison. 2) Learning rate and weight decay: we empirically set them to 1e-6 and 0.2, respectively. 3) Number of epochs: due to the various scales of downstream datasets, we set epoch numbers to 5, 20, 10, and 20 for MSR-VTT, DiDeMo, LSMDC, and ActivityNet, respectively. 4) Frame number: for a fair comparison, we set frame number to 12 except for ActivityNet Captions (set to 32) as its videos are much longer (180 seconds on average). Note that the hyper-parameters of downstream training are the same in all settings in the ablation study. Downstream Datasets. To evaluate performances of video pre-training models, we conduct textto-video retrieval experiments on four typical datasets. (a) MSR-VTT (Xu et al., 2016) contains 10K YouTube videos with 200K descriptions. We follow previous works (Yu et al., 2018; Liu et al., 2019) to train models on 9K videos, and report results on the 1K-A test set. (b) DiDeMo (Anne Hendricks et al., 2017) consists of 10K Flickr videos annotated with 40K sentences. We follow (Liu et al., 2019; Zhang et al., 2018) to evaluate paragraph-to-video retrieval and concatenate all descriptions of a video as one query. (c) LSMDC (Rohrbach et al., 2016) consists of 118,081 video clips sourced from 202 movies with one caption corresponding to each clip. Evaluation is conducted on a test set of 1,000 videos from movies disjoint from the train and validation sets. (d) ActivityNet Captions (Krishna et al., 2017a) contains 20K YouTube videos annotated with 100K sentences. We follow the paragraph-to-video retrieval setting (Zhang et al., 2018; Liu et al., 2019) to train models on 10K videos and report results on the val1 set with 4.9K videos.

5.2. ABLATION STUDIES

Video Proxy Mechanism. For the vision encoder, we evaluate our proposed Video Proxy (ViP) mechanism with different numbers of proxies and compare it with different model structures (i.e. MeanPool, SeqTransformer, Full Attention) by fine-tuning the same pre-trained model on MSR-VTT retrieval task. MeanPool simply takes the average of frame features as the representation of the whole video. For SeqTransformer, we follow the seqTransf type in CLIP4Clip (Luo et al., 2021) and the residual connection in their implementation. Full Attention setting takes all patch tokens as the input of the vision encoder and attention is conducted across all tokens. All models are initialized with CLIP-ViT-B/32. The results are shown in Table 2 . Compared to the MeanPool baseline which completely disregards temporality, SeqTransformer improves the average Recall@1,5,10 by 0.8%. Full Attention type leads to a significant performance drop and we observe a worse initial status and slower convergence than other settings during the experiment. This is consistent with our analysis in Section 4.2 that directly using CLIP for all patches' attention computation will decrease the advantage of CLIP. In our method, different numbers of video proxy tokens all result in significant performance gain on R@1 (e.g., 3.1% by 4 proxies), while only increasing negligible parameters: 3K compared to 86M of ViT backbone. Compared with other settings, our methods in all settings have the most improvement which indicates that our proposed video proxy mechanism can effectively leverage the image-text pre-trained model for video-text post-pretraining. Omnisource Cross-modal Learning. To verify the effectiveness of the proposed Omnisource Cross-modal Learning (OCL) and compare its variants, we set a post-pretraining and fine-tuning pipeline and adopt the same hyper-parameters for all experiments. L V ↔S is the baseline contrastive Method MSR-VTT Retrieval ActivityNet Captions Retrieval R@1 ↑ R@5 ↑ R@10 ↑ Mean ↑ R@1 ↑ R@5 ↑ R@10 ↑ Mean ↑ ClipBERT (Lei et al., 2021) 22.0 46.8 59.9 6.0 21.3 49.0 63.5 6.0 VLM (Xu et al., 2021a) 28.1 55.5 67.4 4.0 ----MMT (Gabeur et al., 2020) 26.6 57.1 69.6 4.0 28.7 61.4 -3.3 Support Set (Patrick et al., 2021) 30.1 58.5 69.3 3.0 29.2 61.6 -3.0 Frozen (Bain et al., 2021) 31.0 59.5 70.5 3.0 28.8 60.9 -3.0 VideoCLIP (Xu et al., 2021b) 30.9 55.4 66. (Xu et al., 2016) and ActivityNet (Krishna et al., 2017a) text-to-video retrieval tasks. * and † respectively denotes that the method uses DSL (Cheng et al., 2021) and QB-Norm (Bogolin et al., 2022) as post-processing operations. loss on video-subtitle pairs. After introducing auxiliary captions, we study four variants of OCL Loss: (a) L V ↔S +L F ↔C ; (b) L V ↔S +L V ↔C ; (c) L V ↔S +L V ↔C +L F ↔C ; (d) L V ↔S,C +L F ↔C as explained in Section 4.3. We pre-train models with each loss function for only one epoch due to the costly training, then finetune on two video-text retrieval datasets: MSR-VTT and DiDeMo. We compare the results with CLIP-MeanPool and CLIP using the proposed Video Proxy mechanism without post-pretraining (i.e., CLIP-ViP). The results are listed in Table 3 . On MSR-VTT dataset, we find that L V ↔S brings very little improvement: 0.4% on average of Recall@1, 5, 10. This is due to the large domain gap between MSR-VTT and post-pretraining data. Combined with auxiliary captions, four variants of OCL loss all bring significant improvements: over 3% on Recall@1 and over 2.3% on average of Recall@1, 5, 10. On DiDeMo dataset, based on the improvement brought by L V ↔S , OCL further improve the results by a large margin: 8% on average of Recall@1, 5, 10. Finally, L V ↔S,C + L F ↔C performs best which is applied as our final setting. Auxiliary Data. In this part, we ablate the contribution of large-scale noisy data and auxiliary data. For uni-source, we use video-subtitle pairs and video-caption data for post-pretraining by vanilla contrastive loss. For data combination, we apply OCL under L V ↔S,C + L F ↔C setting to post-pretrain on the combined data. From Table 4 , Omnisource post-pretraining results are much better than two uni-source results. On MSR-VTT, both uni-source post-pretraining show limited improvement: 67.4% and 66.9% compared with 67.0%. While the Omnisource post-pretraining brings a significant improvement of 2.6%. On DiDeMo, the benefit of data combination is also obvious, with nearly double the improvements brought by uni-source. These results show that the auxiliary data together with our designed joint learning method can effectively adapt the image-text pre-trained model to video-text post-pretraining. As the generation of auxiliary captions is based on OFA-Caption (Wang et al., 2022b) , a powerful image-text pre-trained model, we also explore only including existing data in post-pretraining. We choose image-text pairs of several widely-adopted datasets: MS-COCO, Visual Genome (VG) (Krishna et al., 2017b) , Flickr-30K (Young et al., 2014) , SBU (Ordonez et al., 2011) , CC3M (Sharma et al., 2018) and CC12M as our auxiliary data (namely ImageCaption). To ablate the contribution of these data, we add experiments of post-pretraining on ImageCaption alone and HD-VILA-100M (Rohrbach et al., 2016) text-to-video retrieval tasks. * denotes using post-processing DSL (Cheng et al., 2021) . Method DiDeMo Retrieval LSMDC Retrieval R@1 ↑ R@5 ↑ R@10 ↑ Mean ↑ R@1 ↑ R@5 ↑ R@ combined with ImageCaption. From Table 4 , post-pretraining on ImageCaption alone results in performance degradation on MSR-VTT and marginal improvement on DiDeMo. In contrast, Im-ageCaption yields significant performance gains on both datasets when used as auxiliary data for HD-VILA-100M. This further illustrates the importance of the combination of large-scale noisy data and auxiliary data.

5.3. COMPARISON TO STATE-OF-THE-ART MODELS

We compare our model under full setting (in three epochs) with state-of-the-art works on the textto-video retrieval task. The results of fine-tuning on four datasets (i.e., MSR-VTT, DiDeMo, Ac-tivityNet Captions, LSMDC) are shown in Table 5 and 6 , respectively. We clarify the backbone for CLIP-based works. We only add results with DSL (Cheng et al., 2021) to make fair comparison with some methods using post-processing operations e.g., DSL (Cheng et al., 2021) or QB-Norm (Bogolin et al., 2022) . Our model achieves the best results on all datasets in both CLIP-ViT-B/32 and CLIP-ViT-B/16. Note that some existing methods are also applicable on top of our models as our modification to the CLIP model is minimal. Note that even without post-processing (e.g., DSL), our results still surpass methods using post-processing operations on most datasets. Besides, adding DSL will greatly improve the performance of our model since our model has good bidirectional vision-language correspondence. The good results on the ActivityNet Captions dataset also indicate that our model can generalize well to long videos. Overall, the improvements on different datasets demonstrate the superiority of the video-language representation learned by our CLIP-ViP model.

6. CONCLUSION

In this paper, we study further pre-training (post-pretraining) image-text models like CLIP on largescale video data. We first conduct a preliminary analysis to reveal the factors hindering video postpretraining. Motivated by findings, we propose CLIP-ViP which includes an Omnisource Crossmodal Learning method and a Video Proxy mechanism. The Video Proxy mechanism can better model videos containing temporal information while reducing conflicts with the pre-trained CLIP model. The Omnisource Cross-modal Learning alleviates the problem caused by the domain gap between video-subtitle and downstream data. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin and also achieves new state-ofthe-art results on four widely-used video-language benchmarks.



Figure 1: The curve of finetuning results during post-pretraining. The x-axis indicates the percentage of training steps. The y-axis indicates average value of Recall@1, 5 and 10. [Best viewed in color]

Figure 2: The framework of CLIP-ViP with a text encoder and a vision encoder. Taken features V , F , S, C of videos, frames, subtitles, captions as input, a series of Omnisource cross-modal learning variants are studied to explore better representation learning losses: (a) LV ↔S + L F ↔C ; (b) L V ↔S + L V ↔C ; (c) L V ↔S + L V ↔C + L F ↔C ; (d) L V ↔S,C + L F ↔C .In the vision encoder, Video proxy tokens and the ViP-guided attention mechanism is proposed to transfer CLIP into the video domain.[Best viewed in color]

Figure 2: The framework of CLIP-ViP with a text encoder and a vision encoder. Taken features V , F , S, C of videos, frames, subtitles, captions as input, a series of Omnisource cross-modal learning variants are studied to explore better representation learning losses: (a) LV ↔S + L F ↔C ; (b) L V ↔S + L V ↔C ; (c) L V ↔S + L V ↔C + L F ↔C ; (d) L V ↔S,C + L F ↔C .In the vision encoder, Video proxy tokens and the ViP-guided attention mechanism is proposed to transfer CLIP into the video domain.[Best viewed in color]

MSR-VTT text-to-video retrieval results of finetuning CLIP by different settings. Mean ↑ indicates the average value of Recall@1, 5, and 10. All results are based on CLIP-ViT-B/32.

Ablation study of different losses. We report text-to-video results of models finetuned on MSR-VTT and DiDeMo. Mean ↑ means the average of Recall@1, 5 and 10. All results are based on CLIP-ViT-B/32. All post-pretraining steps are equivalent to one epoch on HD-VILA-100M.

Comparison with SOTA models in MSR-VTT

Comparison with SOTA models in DiDeMo (Anne Hendricks et al., 2017) and LSMDC

funding

* Equal contributon. This work was performed when Hongwei Xue and Yuchong Sun were visiting Microsoft Research as research interns.

annex

From the results in Table 8 , we can see that there is a huge performance gap. Without basing on a CLIP, CLIP-VIP can not leverage the rich knowledge entailed in CLIP, thus leading to a much lower training efficiency.

