ADVL: ADAPTIVE DISTILLATION FOR VISION-LANGUAGE TASKS

Abstract

Large-scale image-text pairs, such as image-captions and image-phrases, enable the strong representation of vision-language (VL) models. Nevertheless, they lose diversity and complexity due to the constraints in collecting data. Meanwhile, models pre-trained with image-only or text-only data (we call them unimodal pretrained models) continue to flourish and impress the community. Compared to image-text pairs, unimodal data has less constraints during the collection process resulting in more diverse styles. A natural question is how to leverage unimodal pre-trained models to benefit downstream VL tasks? Most existing works focus on fusing VL information in the expensive pre-training stage. They directly plug in unimodal pre-trained encoders into a VL framework and redo an additional pre-training step on paired image-text data. This causes additional computation expense and the unimodal pre-trained knowledge might be forgotten. In this paper, we take a different route and investigate how to fuse VL information in the finetuning stage only. To directly transfer pre-trained knowledge from unimodal models to help downstream VL tasks, we propose ADVL, which avoids redoing any pre-training step and is generalizable to be applied on top of various VL models. To comprehensively demonstrate the effectiveness of ADVL, we conduct evaluation across three mostly recognized highly semantic VL benchmarks: VCR, VQA, and SNLI-VE under three settings, low-shot, full-shot and domainshifted settings. Results show that ADVL consistently improves the performance with different VL base models across all settings. It even achieves state-of-theart (SOTA) performance on VCR among models pre-trained with image-text data and delivers competitive results on VQA and SNLI-VE. Based on our analysis, we also discover that ADVL can improve the robustness of VL models and regulate them to better use vision information.

1. INTRODUCTION

Recently, Vision-Language (VL) models (Radford et al., 2021; Jia et al., 2021b; Pham et al., 2021; Wang et al., 2021; Chen et al., 2020; Su et al., 2020; Gan et al., 2020) pre-trained on paired imagetext data have achieved great success on many VL tasks (Zellers et al., 2019; Xie et al., 2019; Antol et al., 2015; Deng et al., 2009) . These paired data generally fall into two categories: curated imagecaption (Sharma et al., 2018; Lin et al., 2014) or noisy online image-text data (Radford et al., 2021; Jia et al., 2021b; Pham et al., 2021; Wang et al., 2021; Yu et al., 2022) . However, primarily using paired image-text data for pre-training restricts the knowledge the model can learn. Image-text data are challenging to collect and their styles are limited. Image captions are typically short, template-like and with limited vocabulary (Chen et al., 2015; Sharma et al., 2018) . The left part of Fig. 1 shows an example of image captions, "A glass of beer on a table". This description differs from texts used in many downstream applications, like VCR (Zellers et al., 2019) and SNLI-VE (Xie et al., 2019) . For these tasks, event inference, spatial/temporal scene understanding are essential. On the other hand, unimodal data, such as text, are relatively easier to collect and they cover a wider range of domains. For example, web-crawled text corporas (Raffel et al., 2020 ; Aaron Gokaslan*) contain longer sentences, paragraphs, stories, and articles with full context. In fact, there exist many readily-available unimodal models pre-trained on billion-scale data.

