CONTRASTIVE ALIGNMENT OF VISION TO LANGUAGE THROUGH PARAMETER-EFFICIENT TRANSFER LEARN-ING

Abstract

Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an already-trained language model and vision model? The literature describes techniques that can create vision-language models by updating a small number of parameters in a language model, but these require already aligned visual representations and are non-contrastive, hence unusable for latency-sensitive applications such as neural search. We explore the feasibility and benefits of parameter-efficient contrastive vision-language alignment through transfer learning: creating a model such as CLIP by minimally updating an already-trained vision and language model. We find that a minimal set of parameter updates (<7%) can achieve the same performance as full-model training, and updating specific components (<1% of parameters) can match 75% of full-model training. We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training and that parameter-efficient scaling scales with model and dataset size. Where paired-image text data is scarce but strong multilingual language models exist (e.g. low resource languages), parameter-efficient training is even preferable to full-model training. Given a fixed compute budget, parameter-efficient training allows training larger models on the same hardware, achieving equivalent performance in less time. Parameter-efficient training hence constitutes an energyefficient and effective training strategy for contrastive vision-language models that may be preferable to the full-model training paradigm for common use cases. Code and weights at https://github.com/codezakh/LilT.

1. INTRODUCTION

Advances in transfer learning within the field of natural language processing (Houlsby et al., 2019b; Ben Zaken et al., 2022) have shown that when adapting to a novel task, updates to a small percentage of neurons (< 1%) in large, pretrained transformer-based language models can achieve nearly equivalent results to finetuning the entire model. Sung et al. (2021) showed that given the existence of already-aligned visual representations (e.g. CLIP's visual encoder) only a small number (4%) of parameters in a pretrained language model need to be updated for the language model to complete tasks such as visual question answering using the already-aligned visual representations. However, the creation of aligned vision and language representations typically involves updating all the parameters of a language model and a vision model, often randomly initialized (Radford et al., 2021) . Zhai et al. (2021) find that if the weights of a pretrained vision model are used as an initialization, only the neurons of the language model need to be updated to align the visual and language representations and match or exceed the performance of full-model training, resulting in a 50% reduction in trainable parameters. We take this line of investigation to its natural conclusion, asking -given that strong, pretrained vision and language models both exist, can we minimally update both of their parameters to align their representations? Answering this question is valuable for two reasons. From a practical perspective, contrastive vision-language alignment constitutes a form of large-scale pretraining and hence a heavy energy expenditure. Methods for parameter-efficient transfer learning result in significantly reduced GPU memory requirements, and can therefore lower energy costs. Second, collecting millions of images with textual annotations is prohibitively expensive when millions of image-text pairs cannot be scraped from the internet, such as in the case of low resource languages or images from domains that require expert descriptions. In these cases, transfer learning by maximally preserving knowledge from strong, unimodal pretraining becomes compelling. Our contributions can be summarized as follows. • We show contrastive vision-language models can be created by updates to a relatively small (<7%) set of parameters in pretrained vision and language models, which we dub LilT (Locked image-language tuning) for brevity. • We conduct an detailed empirical study of combinations and interactions of various methods for parameter-efficient transfer learning. • We show that contrastive vision-language models created with parameter-efficient transfer learning conserve useful existing knowledge from their initializations better than full model finetuning, and this has benefits in realistic scenarios. Limitations Similar to Desai & Johnson (2021), we conduct most of our experiments on the COCO dataset, and conduct additional scaling experiments with a larger dataset of 1.5M pairs. There is a possibility that our conclusions may not hold beyond this range. Second, we choose to focus on zero-shot classification and information retrieval tasks. Our conclusions may not hold for other uses of image-text embeddings, such as using them as input for downstream vision-language tasks. Finally, we explicitly limit the scope of the study to transformer-based contrastive vision-language models. Thus, our conclusions may not apply to those based on other architectures. Despite these limitations, we believe our conclusions are useful because there are realistic situations in which there are much fewer than 1.5M image-text pairs (e.g. low resource languages) available. Outline First, we cover background material ( §2.1), then introduce our approach of parameterefficient transfer learning for contrastive vision-language alignment ( §2). We then describe experiments and a discussion of experimental results ( §3), followed by related work ( §4).

2. METHODS

The basic idea of our approach is to align a vision model and a language model by updating a small percentage of their parameters by gradient descent. This involves four main elements. First, the vision and language model must initialized from strong, pretrained vision and language models, rather than random initialization. Second, we lock all the parameters in each model. Third, we selectively unlock critical parameters. Fourth, we insert small trainable modules into each model to aid adaptation. There are multiple ways of implementing these strategies, which we cover in this section.



Figure 1: A conceptual diagram. After unimodal pretraining, parameter-efficient transfer to contrastive vision-language alignment is achieved by changing as few as 0.3% of the parameters from initialization, matching the performance of full model training.

