CONTRASTIVE ALIGNMENT OF VISION TO LANGUAGE THROUGH PARAMETER-EFFICIENT TRANSFER LEARN-ING

Abstract

Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an already-trained language model and vision model? The literature describes techniques that can create vision-language models by updating a small number of parameters in a language model, but these require already aligned visual representations and are non-contrastive, hence unusable for latency-sensitive applications such as neural search. We explore the feasibility and benefits of parameter-efficient contrastive vision-language alignment through transfer learning: creating a model such as CLIP by minimally updating an already-trained vision and language model. We find that a minimal set of parameter updates (<7%) can achieve the same performance as full-model training, and updating specific components (<1% of parameters) can match 75% of full-model training. We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training and that parameter-efficient scaling scales with model and dataset size. Where paired-image text data is scarce but strong multilingual language models exist (e.g. low resource languages), parameter-efficient training is even preferable to full-model training. Given a fixed compute budget, parameter-efficient training allows training larger models on the same hardware, achieving equivalent performance in less time. Parameter-efficient training hence constitutes an energyefficient and effective training strategy for contrastive vision-language models that may be preferable to the full-model training paradigm for common use cases. Code and weights at https://github.com/codezakh/LilT.

1. INTRODUCTION

Advances in transfer learning within the field of natural language processing (Houlsby et al., 2019b; Ben Zaken et al., 2022) have shown that when adapting to a novel task, updates to a small percentage of neurons (< 1%) in large, pretrained transformer-based language models can achieve nearly equivalent results to finetuning the entire model. Sung et al. (2021) showed that given the existence of already-aligned visual representations (e.g. CLIP's visual encoder) only a small number (4%) of parameters in a pretrained language model need to be updated for the language model to complete tasks such as visual question answering using the already-aligned visual representations. However, the creation of aligned vision and language representations typically involves updating all the parameters of a language model and a vision model, often randomly initialized (Radford et al., 2021) . Zhai et al. (2021) find that if the weights of a pretrained vision model are used as an initialization, only the neurons of the language model need to be updated to align the visual and language representations and match or exceed the performance of full-model training, resulting in a 50% reduction in trainable parameters. We take this line of investigation to its natural conclusion, asking -given that strong, pretrained vision and language models both exist, can we minimally update both of their parameters to align their representations? Answering this question is valuable for two reasons. From a practical perspective, contrastive vision-language alignment constitutes a form of large-scale pretraining and hence a heavy energy expenditure. Methods for parameter-efficient transfer learning result in significantly reduced GPU memory requirements, and can therefore lower energy costs. Second, collecting millions of images with textual annotations is prohibitively expensive when millions of image-text pairs cannot be scraped from the internet, such as in the case of low resource languages or images from domains that require expert descriptions. In these cases, transfer learning by maximally preserving knowledge from strong, unimodal pretraining becomes compelling. Our contributions can be summarized as follows. • We show contrastive vision-language models can be created by updates to a relatively small (<7%) set of parameters in pretrained vision and language models, which we dub LilT (Locked image-language tuning) for brevity. • We conduct an detailed empirical study of combinations and interactions of various methods for parameter-efficient transfer learning. • We show that contrastive vision-language models created with parameter-efficient transfer learning conserve useful existing knowledge from their initializations better than full model finetuning, and this has benefits in realistic scenarios. Limitations Similar to Desai & Johnson (2021) , we conduct most of our experiments on the COCO dataset, and conduct additional scaling experiments with a larger dataset of 1.5M pairs. There is a possibility that our conclusions may not hold beyond this range. Second, we choose to focus on zero-shot classification and information retrieval tasks. Our conclusions may not hold for other uses of image-text embeddings, such as using them as input for downstream vision-language tasks. Finally, we explicitly limit the scope of the study to transformer-based contrastive vision-language models. Thus, our conclusions may not apply to those based on other architectures. Despite these limitations, we believe our conclusions are useful because there are realistic situations in which there are much fewer than 1.5M image-text pairs (e.g. low resource languages) available. Outline First, we cover background material ( §2.1), then introduce our approach of parameterefficient transfer learning for contrastive vision-language alignment ( §2). We then describe experiments and a discussion of experimental results ( §3), followed by related work ( §4).

2. METHODS

The basic idea of our approach is to align a vision model and a language model by updating a small percentage of their parameters by gradient descent. This involves four main elements. First, the vision and language model must initialized from strong, pretrained vision and language models, rather than random initialization. Second, we lock all the parameters in each model. Third, we selectively unlock critical parameters. Fourth, we insert small trainable modules into each model to aid adaptation. There are multiple ways of implementing these strategies, which we cover in this section.

2.1. BACKGROUND

In this section, we briefly cover the mechanics of contrastive language image alignment as used by (Radford et al., 2021) , as well as the common "two-tower" (Zhai et al., 2021) , dual transformer encoder architectures employed by CLIP-style models. Contrastive language image alignment pulls representations of matched image-text pairs together, while pushing those of unmatched pairs apart. The goal is to learn an image encoder f θ and a text encoder g ϕ such that given an image-text pair (x I , x T ), the encoded representations f θ x I and g ϕ x T are close under a distance metric if they are semantically similar and far apart if not. Let x I k , x T k b k=1 be a batch of b image-text pairs. For each image x I k in an image-text pair x I k , x T k , the matched text x T k is the positive, while all other texts within the batch are used as negatives. The image-to-text contrastive loss L I k for x I k is then L I k x I k , x T j b j=1 = - 1 b log exp s I k,k j exp s I k,j , where s I k,j is the similarity of the k-th image to the j-th text. The similarity function is usually taken to be the cosine similarity, which can be easily computed as f θ x I • g ϕ x T if the representations are normalized to unit length. Conversely, the text-to-image contrastive loss for x T k is L T k x T k , x I j b j=1 = - 1 b log exp s T k,k j exp s T j,k . The complete training loss then becomes L = 1 2 b k=1 L I k + L T k . Architectures for contrastive language image alignment must encode both texts and images to vector representations. This is usually implemented using separate text encoder and image encoders. A variety of choices are possible for these encoders, but we restrict ourselves to the popular (Radford et al., 2021; Li et al., 2021a; b; Yao et al., 2021; Khan et al., 2022; Zhai et al., 2021; Yang et al., 2022; Wang et al., 2021) choice of transformer (Vaswani et al., 2017) architectures, specifically, the BERT (Devlin et al., 2019) family of language models for the text encoder, and the ViT (Dosovitskiy et al., 2021) family for the image encoder. Let t(•) denote an arbitrary architecture from one of the above families. After consuming an input x, the transformer t(•) produces a sequence of vectors t(x) = {z cls , z 1 , . . . , z N }, where z cls is the embedding of the [CLS] token, which is taken to be the representation of the input x following dimensionality reduction by a trainable linear projection.

2.2. ADDING ADAPTERS

Aligning the representations of a language transformer and a vision transformer is typically done by updating 100% of the parameters in one (Zhai et al., 2021) or both (Radford et al., 2021) of the transformers. By freezing the transformers, we exclude full-model training, and must use an alternative strategy to align the image and text representations. A promising approach is inserting a small (relative to each transformer), trainable module into the frozen, pretrained transformers that can learn to modify the internal representations of the transformer it is placed within, such that the representation spaces of the frozen vision and language transformers become aligned while leaving the pretrained parameters untouched. We explore two such modules: layerwise adapters (Houlsby et al., 2019a; He et al., 2021) and "deep" adapters. Layerwise adapters (Houlsby et al., 2019a) have been used to adapt pretrained transformer-based language models to new tasks while only updating 2 -3% of model parameters. A layerwise adapter is inserted before each layer normalization (Ba et al., 2016) layer in a transformer, and consists of a weight matrix that downsamples the input, followed by an activation function (we use GELU (Hendrycks & Gimpel, 2016) ) and a weight matrix that restores the input to the original dimensionality, and finally, a residual connection. We depict the architecture / placement of layerwise adapters in Fig 3 . Another solution is to treat the frozen encoders as feature extractors, and learn trainable adapters that align the frozen image and text features. Transformer architectures can be seen as a stack of identically structured transformer encoder layers, so a natural solution to the problem of designing a trainable adapter atop a stack of frozen transformer encoder layers is to grow the stack, and keep the newly added layers trainable. This yields a generic approach (Fig. 2 ) to add a trainable adapter to a frozen transformer from any of the standardized families (e.g. BERT (Devlin et al., 2019) , ViT (Dosovitskiy et al., 2021) ) that only requires a small number of parameters to recieve gradients (≈ 7% for bert-base). We try two strategies for selectively unlocking parameters in a frozen transformer: unlocking the layer normalization (Ba et al., 2016) parameters, and BitFit (Ben Zaken et al., 2022) . Standard transformers (Vaswani et al., 2017) have two layer normalization (Ba et al., 2016) modules for each transformer encoder layer, and these are known to play an important role ( §4). Each layer normalization layer has learnable scale γ and bias parameters β that apply an elementwise scale and shift to the input of the layer normalization layer. In the first strategy, we allow the layer normalization layers to remain unlocked and receive gradient updates. In BitFit (Ben Zaken et al., 2022) , (Bias-term Finetuning), the additive bias terms of every module in a transformer encoder layer are allowed to remain unlocked and receive gradient updates. Both of these strategies unlock a small percentage (0.24% and 0.31% of the parameters in a 12-layer base transformer respectively).

2.4. IMPLEMENTATION DETAILS

Datasets We draw 591, 753 image-text pairs from the training set of COCO2014Lin et al. ( 2014), following the split of Karpathy & Fei-Fei (2017) . The weights of the vision encoders are initialized from DeiT Touvron et al. (2021) , and the text encoders are initialized from SimCSE (Gao et al., 2021) . We train each model with a batch size of 512 on 4x NVIDIA A6000 GPUs for 15 epochs, using the AdamW optimizer (Loshchilov 

3. EXPERIMENTS

We conduct experiments on zero-shot multimodal classification, image-text retrieval, and multilingual image text retrieval to investigate the following research questions. 1. Can contrastive vision language models be created through parameter-efficient transfer learning? 2. How do different methods for parameter efficient transfer learning interact with each other? 3. Do contrastive vision language models created through parameter-efficient transfer learning conserve useful knowledge from their initializations better than full-model finetuning? 4. Does parameter-efficient transfer learning scale with respect to model size and dataset size? We evaluate all models on five tasks: zero-shot natural-language guided image classification (Radford et al., 2021) , image-to-text retrieval (TR), text-to-image retrieval (IR), and 0-shot TR/IR. For zero-shot classification, we use the ImageNetV2 (Recht et al., 2019) test set. For IR/TR, we use the COCO2014 test split of Karpathy & Fei-Fei (2017) , containing 5k images and 25k captions. For zero-shot IR/TR, we use the test set of Flickr30k (Plummer et al., 2015) , containing 1k images and 5k captions.

3.1. ABLATION STUDY

The results of the study are displayed in Table Discussion First, it is clear that creating contrastive vision-language models through parameterefficient transfer learning is feasible, and there are clear differences between model capabilities induced by different parameter-efficient transfer learning methods. Layerwise adapters stand out as the parameter-efficient transfer learning strategy capable of matching or exceeding full-model training. However, in cases where the language distribution is sufficiently simple (e.g. a list of singular words), parameter unlocking is sufficient, and easier to implement. Deep adapters stand out for their ability to achieve better performance than full-model training when combined with LiT (m).

3.2. CONSERVATION OF KNOWLEDGE FROM INITIALIZATION

We hypothesize that parameter efficient transfer learning preserves more knowledge from initialization than full model finetuning, and this is beneficial in some realistic scenarios. Low-resource languages likely do not have large-scale image-text pairs available to train a multimodal CLIP-like model for that language. However, unimodal, multilingual language models that have been trained on a dataset containing sentences from a given low-resource language often exist. A possible solution in this situation is to train a CLIP-like model on available image-text pairs from a high-resource language, while using a multilingual language model as the text encoder. The resulting model may be able to generalize to image-text retrieval tasks in a language unseen during vision-language alignment due to the multilinguality of the pretrained text encoder. We simulate this setting by aligning a pretrained multilingual BERT-base model with an ImageNet-pretrained ViT-B/16 on English-only image-text pairs, and evaluate it on image-text pairs in six different languages that the model was never provided paired images for. If parameter-efficient training preserves more knowledge from initialization, and that knowledge is useful, we expect that the retrieval model created through parameter efficient transfer learning should retain more of its multilingual language ability, and hence display greater accuracy on non-English languages. We reuse the English training data from §2.4, and evaluate each model on the test set of Aggarwal & Kale (2020) , which contains 1400 image-text pairs, split equally between Russian, Polish, Turkish, Chinese, Korean, Italian, and Spanish. We summarize results in Table 2 . LilT LwA outperforms CLIP on 12/14 tasks (5.3% absolute improvement), while LilT DA achieves better performance than CLIP on 11/14 tasks (1.4% absolute improvement). This suggests that parameter-efficient transfer learning conserves more information from initialization, and that information is useful for multimodal tasks.

3.3. SCALING WITH RESPECT TO DATA AND MODEL SIZE

Can parameter-efficient transfer learning take advantage of larger models and larger amounts of data? We test the the performance of parameter-efficient transfer learning as the amount of image-text pairs is increased to 1500k from 591k (Table 4 ) and as model size is increased ( Configuration # Trainable % Trained TR@1 IR@1 TR@5 IR@5 Acc- Configuration # Trainable % Trained TR@1 IR@1 TR@5 IR@5 Acc- We attempt to understand how alignment changes the language and vision model by studying the layer normalization layers of each model. Let f θ be an image encoder g ϕ be a text encoder. We initialize f θ with weights from DEiTTouvron et al. ( 2021), and g ϕ with weights from SimCSE Gao et al. (2021) . We then lock all parameters except the layer normalization layers (configuration (c) in Tab. 1), and train the model following the standard CLIP training procedure, resulting in a pair of aligned encoders ( fθ , ḡϕ ). In total, we have four different models: the unaligned and aligned image encoders (f θ , fθ ) and the unaligned and aligned text encoders (g ϕ , ḡϕ ). Without loss of generality, we describe our procedure for the text encoder pair (g ϕ , ḡϕ ). Let LN 1 i (γ, β) and LN 2 i (γ, β), denote the two normalization sublayers of the i-th layer in the transformer encoder stack. For layer i ∈ 1, 2, . . . N , we plot the L 1 norm of the difference between the trainable layer normalization parameters γ, β of the aligned and unaligned encoders. We plot the results in Fig 4 . Surprisingly, the text and image encoders display clearly opposite patterns (negative Pearson's r). In the text encoder, the difference between the aligned and unaligned layer normalization parameters decreases with depth -layer normalization parameters in the deeper layers of the text encoder change less as a result of alignment training. This is the opposite of the image encoder. In the image encoder, the layer normalization 

Discussion

The patterns in the layer normalization layers may indicate that during alignment, the language and image modalities undergo changes at different semantic levels. The shallowest three layer normalization layers of the ViT-B/16 experience a ≈ 70% lower magnitude shift than the deepest three layers. The shallow layers of a vision transformer attend more to local information (Raghu et al., 2021) , while the deeper layers attend more to global context. Intuitively, this makes sense -we should expect an asymmetry between the amount of information in a short image caption compared to a dense image. Simple natural language concepts are often visually complex. Interestingly, this has already been exploited by certain vision-language models - (Khan et al., 2022; Li et al., 2021a) align the lower half of their text encoder to the visual encoder, while using the top half for a different purpose. This makes sense, given that the lower layers of the text encoder seem to change the most during alignment.

4. RELATED WORK

Vision-Language Pretraining The dual-encoder CLIP (Radford et al., 2021 ) (400m pairs) and ALIGN (Jia et al., 2021 ) (1b+ pairs) architectures were the first attempts at large-scale contrastive image-language alignment using the InfoNCE (van den Oord et al., 2018) loss to maximize the mutual information between matched image and text pairs. Subsequent work (Pham et al., 2021; Li et al., 2021b; Yao et al., 2021; Cui et al., 2022; Yang et al., 2022; Khan et al., 2022; Li et al., 2021a) 2021) perform better on benchmarks, their multimodal encoder makes them unsuitable for latency-sensitive search application, because rather than learning separate but aligned image and text embeddings, they learn a single multimodal embedding for an image-text pair. Thus, neural search remains the domain of contrastive vision-language models. Frozen Language Models Tsimpoukelli et al. (2021) demonstrated that pretrained large language models are capable of quickly adapting to image understanding. They use an autoregressive transformer-based language model, which is frozen. A trainable ResNet (He et al., 2016) is then trained to transform images into input the frozen transformer can understand, by backpropagating the loss through the frozen transformer. MAGMA Eichenberg et al. (2021 ), FROMAGE Koh et al. (2023) and FLAMINGO Alayrac et al. (2022) scaled the conceptual approach of Tsimpoukelli et al. (2021) to billions of parameters, and recently, Merullo et al. (2022) have shown that a simple linear mapping is enough to allow a frozen large language model to (roughly) understand visual input, as long as the visual encoder has been trained to represent visual concepts aligned to language (e.g. CLIP). However, emerging approaches such as BLIP-2 Li et al. (2023) show that by combining soft prompting with a frozen LLM and a trainable visual encoder, a LLM can achieve state-of-the-art accuracy on visuolinguistic understanding tasks such as visual question answering. Lu et al. (2021) propose the idea that transformers trained on language are capable of a form of universal computation, and can adapt to new tasks even if they are frozen, and do so better than fine-tuned models. However, Rothermel et al. (2021) find the findings may be reversed under certain hyperparameter settings. Interestingly, both note that the normalization layers seem to play an important role in this adaptation. Parameter-Efficient Finetuning Many forms of adapters (Houlsby et al., 2019b; Karimi Mahabadi et al., 2021; Mahabadi et al., 2021) have been explored in natural language processing. VL-Adapter (Sung et al., 2021) investigate adapters in vision-language, but assume aligned visual representations. Lester et al. (2021) find that for very large language models, parameter-efficient adaptation approaches such as soft prompting are equivalent to finetuning the large language model. Liu et al. (2021) extend this finding, showing that combining soft prompting with adapters can often exceed finetuning on a given downstream task. Both prefix (Li & Liang, 2021) and prompt (Lester et al., 2021) tuning can also be understood as exploiting the knowledge in frozen transformers, as their optimization loops involve freezing the language model, effectively turning it into a part of the loss. Zhang & He (2020) develop a training scheme that progressively unfreezes / freezes layers of a transformer language model, and see significant improvements in training speed. Progressive growth approaches (Gu et al., 2021) slowly increase the depth of a transformer as training proceeds. Layer Normalization in Transformers Kovaleva et al. (2021) find that the representations of transformers contain outlier dimensions that disrupt the quality of the learned embedding, and point to high-magnitude parameters in the layer normalization layers. A variety of techniques targeting layer normalization in transformers have been proposed, with various benefits. Xiong et al. (2020) prove that the placement of layer normalization layers relative to the residual connection in the transformer block contributes to learning instability under large learning rates, and propose an alternate placement. In contrast, FixUp (Huang et al., 2020 ) develops a novel initialization scheme for transformers that enables removing the normalization layers entirely. ReZero (Bachlechner et al., 2021) adds a learnable gate parameter to each residual connection before layer normalization, and demonstrate training extremely deep transformers quickly.

5. CONCLUSION & FUTURE WORK

We show that the performance of full model training for contrastive vision language alignment can be matched by updating a small number of parameters in existing vision models and language models, followed by an insertion of trainable modules. This suggests that the current paradigm of full-model training for contrastive vision language alignment involves significant unnecessary computation, and can be replaced by parameter-efficient transfer learning when the downstream use cases are natural-language classification or image-text retrieval. Current alignment strategies align representations from the top of each encoder stack. We find that in the text encoder, alignment changes the normalization parameters in the shallowest layers the most, while it is the opposite for the image encoder. Investigating and exploiting the asymmetry between vision and language could yield further benefits for multimodal understanding or more efficient training strategies. For future work, it would be interesting to analyze whether CLIP-like models created through parameter-efficient transfer learning are similar to CLIP in ways other than performance -for example, are they more or less biased? Or more or less robust to distribution shift? Another useful line of investigation would be probing vision-language models further to understand how alignment training effects the ability of the model to understand language. In summary, we believe that existing training methods are not fully exploiting the knowledge that exists in their initializations. Our approach presents one simple but effective way to use that knowledge. 2021)). We train three models, one for each dataset size. The results can be seen in Fig 6 . Compared to the randomly initialized model, the pretrained model is substantially better across all three datasets and all 3 model sizes. However, it is likely that the benefit of unimodal pretraining will be diminished as the number of training pairs available for multimodal vision-language pretraining increases, although we do not explore this. et al. (2021) . We train all models on 591k pairs following §2.4. The unimodal pretraining methods chosen do have an effect on the performance on the vision-language model. The combination of SimCSE and DeiT appears to be consistently better than other combinations, although on ImageNetV2, BERT-DeiT performs better.

6.5. ZERO-SHOT PROMPTS

Although CLIPRadford et al. ( 2021) uses a prompt ensemble, we use only a single prompt for all datasets except SVHN: a photo of { }. For SVHN, we use the prompt a photo of the number { }.

6.6. ENCODER SYMMETRY

Which encoder matters more? We train three configurations of CLIP on 5k, 50k, 591k pairs (Fig. 8 ). One is the symmetric CLIP-base, while the two asymmetric configurations have their text encoder and image encoder respectively replaced with the "tiny" version. Across all three dataset scales, the model with the smaller text encoder performs worse. Zhai et al. (2021) find that on large scale data (10m+ pairs), the opposite holds true -a larger image encoder is better than a larger language model.

6.7. DOES LILT WORK WITH SMALLER MODELS AND LESS DATA?

We test LilT and full-model training on smaller versions of transformers, corresponding to 'bert-base', 'bert-small', 'bert-tiny', and with decreasing amounts of image-text pairs (5k, 50k). The results are depicted in Figure 9 and Figure 10 for LilT D A. There are no idiosyncratic results -as model size is decreased, performance decreases for both full model training and parameter efficient transfer learning. Similarly, as the amount of data decreases, performance also decreases. This holds true for all tested combinations of dataset size and model size. 



Figure 1: A conceptual diagram. After unimodal pretraining, parameter-efficient transfer to contrastive vision-language alignment is achieved by changing as few as 0.3% of the parameters from initialization, matching the performance of full model training.

Figure 2: Growing the transformer encoder stack to add a trainable deep adapter to a locked model. The deep adapter is architecturally identical to a layer from the encoder stack.

Figure 3: The architecture and placement of layerwise adapters combined with a layernorm unlocking strategy.

Figure 4: The depth of the layer normalization layers affects how much they are changed by alignment training, and the pattern is reversed between the image and text encoders. ρ is the Pearson correlation coefficient, and the translucent blue/yellow shading indicates 95% confidence intervals.

Figure 5: We freeze all parameters except for the LN parameters, then progressively lock LN parameters by layer. Fig 4 suggests that freezing the LN parameters in the deepest layers of the language model and the shallowest layers of the vision model (Pattern A) should have a smaller effect on performance than the opposite pattern (Pattern B), relative to the baseline (LNs in every layer unlocked) which we observe.

has improved on the training tasks, dataset, and architecture of CLIP. While systems utilizing a multimodal encoder and cross attention Li et al. (2022); Khan et al. (2022); Wang et al. (2022); Lu et al. (2022); Zhu et al. (

Figure6: The effect of pretraining on model performance.

Figure 7: A comparison of different kinds of pretraining on LilT performance. Each model is trained on 591k pairs.

Figure 9: LilT's performance scales with increasing model size and dataset size -it is not limited to a specific model size or dataset size. LilT DA is pictured.

Figure 10: The performance of full-model training on smaller models and with less data.

An ablation study with bert-base as the text encoder and a ViT-B/16 as the image encoder. An indicates the component is locked and does not recieve gradient updates, while indicates the opposite. LN( T / I ) indicates the layer normalization weights in the text encoder were locked while those of the image encoder recieved gradient updates, and vice versa for LN( T / I ). θ is the trainable linear projection. TR and IR is mean text retrieval and image retrieval scores across Rank5,10. Deep (Fig 3)  and Layerwise (Fig.2) adapters are detailed in §2.2, and BitFit in §2.3.

Cross-lingual zero-shot retrieval. A multilingual bert-base model is aligned with a ViT-B/16 on English image-text pairs from COCO, and evaluated on image-text pairs in languages unseen during alignment. it is possible to only align only one of the encoders in a parameter-efficient manner. While LiT (k) excels at image classification, it suffers from a similar problem as parameter unlocking strategies: it is relatively poor at image text retrieval.



Zero-shot task performance of base/large models after parameter-efficient training.LwA/DA indicates adapter types, corresponding to (rows h/f in Table1).

Zero-shot performance of base models after larger-scale pretraining (1.5M pairs).

ACKNOWLEDGMENTS

This work was supported by a faculty award from NEC Laboratories America.

6. APPENDIX 6.1 ADDITIONAL DATASETS

We conduct zero-shot classification experiments on three further datasets (Table 5 ): CIFAR-100 Krizhevsky (2009) , SVHNNetzer et al. (2011), and ImageNet-AHendrycks et al. (2021) . As CIFAR-100 and SVHN are both standard datasets, we only briefly describe them here. The CIFAR-100 dataset consists of 60k 32x32 colour images divided into 100 classes containing 600 images per class. Each class has 500 training and 100 test images, for a total of 50k training and 10k test images. We use the CIFAR-100 test set for the evaluations. SVHN is a harder version of MNIST Deng (2012) , consisting of natural images of digits cropped from street-level pictures. We use the 26k test images for evaluation. ImageNet-A consists of natural adversarial examples from the ImageNet1k distribution, which are natural, correctly labeled images that classifiers incorrectly classify with high confidence. We use the 7k test images.Table 5 : Evaluation on additional zero-shot classification tasks. First place is in bold and second place is in red. LilT models are boxed in green. Acc-1 stands for top-1 accuracy, and Acc-5 is top-5 accuracy. Higher is better. 

6.2. NATURAL ADVERSARIAL EXAMPLES

Vision language models display impressive performance on ImageNet-A. ImageNet-A can be considered a "hard slice" of the ImageNet distribution, containing samples which are problematic for supervised classifiers. Suprisingly, the zero-shot classification performance of self-supervised vision-language models on ImageNet-A matches and is sometimes greater than the performance of supervised classifiers (ResNet-50 He et al. (2016) and VGG-19 Simonyan & Zisserman (2015) ). This may be partially due to the parameter count -there are more total parameters in most of the visionlanguage models compared to the supervised CNNs. However, considering that the vision-language models are facing a harder problem (performing zero-shot classification), their performance relative to supervised CNNs is surprising.

6.3. WHERE DO THE MODELS FAIL?

On the SVHN dataset, performance is poor. The large models perform worse than random chance (< 10%), and the smaller the model, the better it performs. One explanation could be that there is no way for the models to learn a correspondence between images of digits and the name of each digit, as nothing similar appears in the COCO training distribution, which only contains common objects.

