CLIP-VIP: ADAPTING PRE-TRAINED IMAGE-TEXT MODEL TO VIDEO-LANGUAGE ALIGNMENT

Abstract

Pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, there are works that transfer image representation to the video domain and achieve good results. However, adapting image-text pre-trained models to video-text pre-training (i.e., post-pretraining) has not demonstrated a significant advantage yet. In this paper, we tackle this challenge by raising and addressing two questions: 1) what are the factors hindering post-pretraining CLIP from improving performance on video-text tasks, and 2) how to mitigate the impact of these factors. Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have large impacts. By these observations, we propose an Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model achieves state-of-the-art results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and Ac-tivityNet. We release our code and pre-trained CLIP-ViP models at https: //github.com/microsoft/XPretrain/tree/main/CLIP-ViP.

1. INTRODUCTION

In the past few years, vision-language pre-training has achieved great success on cross-modal representation learning from a large scale of web-crawled data (Radford et al., 2021; Jia et al., 2021; Li et al., 2021; Wang et al., 2021b; Zellers et al., 2021; 2022; Bain et al., 2021) . Among them, imagetext pre-trained models (Radford et al., 2021; Jia et al., 2021) have shown powerful capability for various downstream tasks, including visual understanding (Gu et al., 2021; Wang et al., 2021a; Rao et al., 2022) , image-text generation (Patashnik et al., 2021; Mokady et al., 2021) and so on (Guzhov et al., 2022; Zhang et al., 2022) . In light of the well-learned and enriched visual representation, some works directly adapt image-text pre-trained models to video-text downstream tasks without further pre-training on video data (Luo et al., 2021; Fang et al., 2021; Gorti et al., 2022; Zhao et al., 2022) , while still outperforming models pre-trained on video data (Xu et al., 2021b; Bain et al., 2021) . Utilizing an existing powerful image-text pre-trained model for further video-text pre-training (i.e., post-pretraining) is able to reduce the required training cost by making good use of the knowledge learned from images. However, adapting image-text pre-trained models to video-text data for postpretraining has not demonstrated a significant advantage yet, and thus is still under-explored. A preliminary study is conducted by CLIP4Clip (Luo et al., 2021) which adopts MeanPooling by averaging multiple frame features based on the CLIP model on a subset of Howto100M (Miech et al., 2019) . While the improvement over directly using the image-text pre-trained model is marginal for

funding

* Equal contributon. This work was performed when Hongwei Xue and Yuchong Sun were visiting Microsoft Research as research interns.

