ADVL: ADAPTIVE DISTILLATION FOR VISION-LANGUAGE TASKS

Abstract

Large-scale image-text pairs, such as image-captions and image-phrases, enable the strong representation of vision-language (VL) models. Nevertheless, they lose diversity and complexity due to the constraints in collecting data. Meanwhile, models pre-trained with image-only or text-only data (we call them unimodal pretrained models) continue to flourish and impress the community. Compared to image-text pairs, unimodal data has less constraints during the collection process resulting in more diverse styles. A natural question is how to leverage unimodal pre-trained models to benefit downstream VL tasks? Most existing works focus on fusing VL information in the expensive pre-training stage. They directly plug in unimodal pre-trained encoders into a VL framework and redo an additional pre-training step on paired image-text data. This causes additional computation expense and the unimodal pre-trained knowledge might be forgotten. In this paper, we take a different route and investigate how to fuse VL information in the finetuning stage only. To directly transfer pre-trained knowledge from unimodal models to help downstream VL tasks, we propose ADVL, which avoids redoing any pre-training step and is generalizable to be applied on top of various VL models. To comprehensively demonstrate the effectiveness of ADVL, we conduct evaluation across three mostly recognized highly semantic VL benchmarks: VCR, VQA, and SNLI-VE under three settings, low-shot, full-shot and domainshifted settings. Results show that ADVL consistently improves the performance with different VL base models across all settings. It even achieves state-of-theart (SOTA) performance on VCR among models pre-trained with image-text data and delivers competitive results on VQA and SNLI-VE. Based on our analysis, we also discover that ADVL can improve the robustness of VL models and regulate them to better use vision information.

1. INTRODUCTION

Recently, Vision-Language (VL) models (Radford et al., 2021; Jia et al., 2021b; Pham et al., 2021; Wang et al., 2021; Chen et al., 2020; Su et al., 2020; Gan et al., 2020) pre-trained on paired imagetext data have achieved great success on many VL tasks (Zellers et al., 2019; Xie et al., 2019; Antol et al., 2015; Deng et al., 2009) . These paired data generally fall into two categories: curated imagecaption (Sharma et al., 2018; Lin et al., 2014) or noisy online image-text data (Radford et al., 2021; Jia et al., 2021b; Pham et al., 2021; Wang et al., 2021; Yu et al., 2022) . However, primarily using paired image-text data for pre-training restricts the knowledge the model can learn. Image-text data are challenging to collect and their styles are limited. Image captions are typically short, template-like and with limited vocabulary (Chen et al., 2015; Sharma et al., 2018) . The left part of Fig. 1 shows an example of image captions, "A glass of beer on a table". This description differs from texts used in many downstream applications, like VCR (Zellers et al., 2019) and SNLI-VE (Xie et al., 2019) . For these tasks, event inference, spatial/temporal scene understanding are essential. On the other hand, unimodal data, such as text, are relatively easier to collect and they cover a wider range of domains. For example, web-crawled text corporas (Raffel et al., 2020 ; Aaron Gokaslan*) contain longer sentences, paragraphs, stories, and articles with full context. In fact, there exist many readily-available unimodal models pre-trained on billion-scale data. Considering this, we posit that the diverse styles and sources of text and image data in unimodal pretraining can benefit downstream VL tasks, in addition to paired image-text data. Inspired by this, we study how to effectively and efficiently transfer pre-trained knowledge from unimodal encoders to improve downstream VL models. In this paper, we propose ADVL, Adaptive Distillation that leverages unimodal pre-trained models for improving performance on VL tasks. ADVL takes unimodal encoders as the teacher models and distill their knowledge into a student VL model that pre-trained on multimodal data. To allow distillation from vision encoders and text encoders from different pre-training sources, ADVL includes separate distillation pathways for vision and language. Further, two adaptive mechanisms, Adaptive Confidence-based Weighting and Adaptive Text Token Selection, are proposed in instance-level and token-level to dynamically adjust the significance of the distilled knowledge. Finally, an adaptive two-step finetuning strategy is introduced to mitigate the domain gap in the distillation. Prior works (Shen et al., 2021; Tan & Bansal, 2019; Li et al., 2019) has proposed to initialize weights from unimodal vision or text encoders when pre-training the VL models with paired image-text data. Some other studies (Li et al., 2020; Zhou et al., 2022) have focused on joint pre-training with both multimodal and unimodal data. As opposed to these works, ADVL leverages unimodal pre-trained models when fine-tuning a VL model for a downstream task. This setting has the following benefits: (1) Computation and energy efficient: pre-training VL models is resource-hungry and may not be practical when computation resource is limited. ADVL avoids redoing the pre-training step. (2) Modularization and future proof : ADVL allows researchers to integrate new pre-trained vision, text and multimodal encoders when they become available. This is essential as more and more large unimodal and multimodal pre-trained models are released. (3) Flexibility and generalization: The flexibility of selecting various combinations of pre-trained encoders as teacher models enables researchers to explore and find the optimized teacher models (e.g., in the same domain) for a specific downstream task. To verify the effectiveness of ADVL, we implement ADVL on top of several high-performing VL models and evaluate the results on 3 popular VL tasks, VCR (Zellers et al., 2019) , SNLI-VE (Xie et al., 2019), and VQA (Antol et al., 2015) , under low-shot, full-data, and domain-shifted settings. With extensive experiments, we show that ADVL improves the performance of VL models on various tasks across all settings by levering unimodel encoders. Moreover, it achieves SOTA performance on VCR-Q2A compared to other non-ensembled models pre-trained with large image-text data. It also achieves competitive performance on SNLI-VE and VQA. Furthermore, we discover the existing VL models tend to under-utilize vision information. With ADVL, the vision information can be better used. As a result, the model is more robust in the domain-shifted setting. We plan to release the code to facilitate future research in the community.

2. RELATED WORK

Pre-training with image and text data: Existing pre-training frameworks leveraging image and text data can fall into three essential categories: (1) Unimodal pre-training including image-only pre-training, e.g., (Dosovitskiy et al., 2020; Liu et al., 2021b; Dai et al., 2021) or text-only pre-



Figure 1: Examples of paired, unpaired and downstream image and text data. Unpaired text and image have more diverse distribution in terms of contents and structures than paired data.

