CLIP-VIP: ADAPTING PRE-TRAINED IMAGE-TEXT MODEL TO VIDEO-LANGUAGE ALIGNMENT

Abstract

Pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, there are works that transfer image representation to the video domain and achieve good results. However, adapting image-text pre-trained models to video-text pre-training (i.e., post-pretraining) has not demonstrated a significant advantage yet. In this paper, we tackle this challenge by raising and addressing two questions: 1) what are the factors hindering post-pretraining CLIP from improving performance on video-text tasks, and 2) how to mitigate the impact of these factors. Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have large impacts. By these observations, we propose an Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model achieves state-of-the-art results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and Ac-tivityNet. We release our code and pre-trained CLIP-ViP models at https: //github.com/microsoft/XPretrain/tree/main/CLIP-ViP.

1. INTRODUCTION

In the past few years, vision-language pre-training has achieved great success on cross-modal representation learning from a large scale of web-crawled data (Radford et al., 2021; Jia et al., 2021; Li et al., 2021; Wang et al., 2021b; Zellers et al., 2021; 2022; Bain et al., 2021) . Among them, imagetext pre-trained models (Radford et al., 2021; Jia et al., 2021) have shown powerful capability for various downstream tasks, including visual understanding (Gu et al., 2021; Wang et al., 2021a; Rao et al., 2022) , image-text generation (Patashnik et al., 2021; Mokady et al., 2021) and so on (Guzhov et al., 2022; Zhang et al., 2022) . In light of the well-learned and enriched visual representation, some works directly adapt image-text pre-trained models to video-text downstream tasks without further pre-training on video data (Luo et al., 2021; Fang et al., 2021; Gorti et al., 2022; Zhao et al., 2022) , while still outperforming models pre-trained on video data (Xu et al., 2021b; Bain et al., 2021) . Utilizing an existing powerful image-text pre-trained model for further video-text pre-training (i.e., post-pretraining) is able to reduce the required training cost by making good use of the knowledge learned from images. However, adapting image-text pre-trained models to video-text data for postpretraining has not demonstrated a significant advantage yet, and thus is still under-explored. A preliminary study is conducted by CLIP4Clip (Luo et al., 2021) which adopts MeanPooling by averaging multiple frame features based on the CLIP model on a subset of Howto100M (Miech et al., 2019) . While the improvement over directly using the image-text pre-trained model is marginal for either zero-shot or fine-tuning settings. In this paper, we aim to explore how to effectively adapt the image-text pre-trained model (e.g., CLIP) to video-language representation learning for video-text tasks (e.g., text-to-video retrieval). To unleash the power of video data to adapt image-text pre-trained models for post-pretraining, we conduct several preliminary experiments to figure out the challenges that hinder post-pretraining. First, we explore post-pretraining an image-text pre-trained model (i.e., CLIP) with MeanPooling on video-text datasets with different scales, including WebVid-2.5M (Bain et al., 2021) and HD-VILA-100M (Xue et al., 2022) . The result shows that the scale of data is critical for video-text post-pretraining. Data on a small scale makes the model easy to over-fit the new data while the knowledge learned from image-text is suppressed and the performance is reduced. Second, we investigate the language domain gap between pre-training data and downstream data. By calculating the Normalized Mutual Information (NMI) on clusters of text features, we find that there is a large domain gap between subtitles that are used in large-scale video-text pre-training data and descriptive texts in downstream tasks. To mitigate the impact of the above factors, we propose CLIP-ViP to adapt the pre-trained image-text model CLIP for video-text pre-training. First, we introduce auxiliary captions that have a smaller language domain gap with downstream data into existing large-scale video-text data. Instead of using a video captioning model which may cause data leakage by training on the same dataset with video-text downstream tasks, and considering a better visual captioning capability, we adopt an image captioning model to generate an auxiliary caption of middle frame in each video. In order to adapt a Transformer-based vision encoder to process both images and videos with minimal modification, we then propose video proxy tokens and design a proxy-guided video attention mechanism for the Vision Transformer (ViT). Specifically, during attention computation in each block, video proxy tokens can interact with all tokens, while patch tokens only interact with video proxy tokens and patch tokens within the same frame. Our vision encoder only increases negligible parameters and calculations compared to the vanilla Vision Transformer while increasing the generality and extendability. To facilitate cross-modal representation learning from both caption-frame and video-subtitle data types at the same time, we propose an Omnisource Cross-modal Learning (OCL) method for pre-training and study a series of variants to find the best fusion strategy. Our experimental results show that our approach improves the performance of CLIP on text-to-video retrieval tasks by a large margin. We also conduct ablation studies to verify the effectiveness of each part in our approach. Our contributions are summarized as follows: (1) We are one of the first to explore factors that hinder video post-pretraining on pre-trained image-text models; (2) We propose CLIP-ViP that can effectively leverage image-text pre-trained model for post-pretraining; (3) We conduct extensive experiments to verify the effectiveness of our method. Our model outperforms the state-of-the-art results by a large margin on four widely-used benchmarks.

2. RELATED WORK

Vision-Language Pre-Training End-to-end models (Lei et al., 2021; Xue et al., 2022; Zellers et al., 2021; Fu et al., 2021; Huang et al., 2020; 2021b; Xue et al., 2021; Li et al., 2021; Kim et al., 2021; Huang et al., 2021a; Sun et al., 2022) for vision-language pre-training are replacing the traditional approach using pre-extracted visual features by off-the-shelf models (Sun et al., 2019; Xu et al., 2021b; Zhu & Yang, 2020; Li et al., 2020b; a; Chen et al., 2020) . Training end-to-end models on large-scale web-collected data also gradually demonstrates the big advantages (Radford et al., 2021; Jia et al., 2021; Xue et al., 2022; Zellers et al., 2021; 2022) . Unlike images that have alttexts, large-scale video datasets suitable for pre-training usually use subtitles as text sources (Miech et al., 2019; Xue et al., 2022) . Subtitles are much noisier than alt-texts, according to (Miech et al., 2019) , typical examples of incoherence include the content producer asking viewers to subscribe to their channel, talking about something unrelated to the video, or describing something before or after it happens. Bain et al. collect a video dataset WebVid (Bain et al., 2021) with textual description annotations. Their texts are well aligned with the video and avoid suffering from ASR errors. However, the vast majority of WebVid videos are sourced from a stock footage website, so scaling up is under limitation. The video-subtitle data is more easily accessible on the web and thus suitable for scaling up. In this paper, we investigate the unfavorable factors of video-subtitle data and explore how to mitigate the impact of these factors.

funding

* Equal contributon. This work was performed when Hongwei Xue and Yuchong Sun were visiting Microsoft Research as research interns.

