ASIF: COUPLED DATA TURNS UNIMODAL MODELS TO MULTIMODAL WITHOUT TRAINING

Abstract

Aligning the visual and language spaces requires to train deep neural networks from scratch on giant multimodal datasets; CLIP (Radford et al., 2021) trains both an image and a text encoder, while LiT (Zhai et al., 2022) manages to train just the latter by taking advantage of a pretrained vision network. In this paper, we show that sparse relative representations are sufficient to align text and images without training any network. Our method relies on readily available single-domain encoders (trained with or without supervision) and a modest (in comparison) number of image-text pairs. ASIF redefines what constitutes a multimodal model by explicitly disentangling memory from processing: here the model is defined by the embedded pairs of all the entries in the multimodal dataset, in addition to the parameters of the two encoders. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multimodal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.

1. INTRODUCTION

Figure 1 : ASIF is a simple recipe to align the representations of two frozen pre-trained models exploiting the fact that relative distances are preserved across different modes: the captions of similar images are themselves similar. Large multimodal models such as CLIP (Radford et al., 2021) are rapidly becoming the standard for foundation models (Bommasani et al., 2021) in computer vision. This is largely due to their zero-shot and open-world capabilities that enable diverse suites of downstream tasks, from classification to detection and visual search. Overall, Radford et al. (2021) demonstrated that scale is the key ingredient for building a common latent space for images and text, and is sufficient to convincingly solve a multitude of tasks without training explicitly for them. In Training models at such scale presents several challenges beside the obvious infrastructure and training costs. Notably, it requires collecting massive training sets, making it difficult to interpret the predictions of the model in light of their training data. Additionally, the training assets are often not owned by the institution training the model (Sun et al., 2017) . This introduces several additional challenges, from reproducibility to the difficulty of ensuring that an asset owner can remove their data from the model (Golatkar et al., 2021; 2020b; a; Ginart et al., 2019; Guo et al., 2020) . Overall, these considerations make large multi-modal models relatively inaccessible to researchers and practitioners until checkpoints are released or access to demo is granted. Figure 2 : The ASIF construction. An ASIF model is defined by two unimodal pretrained encoders and a collection of coupled embeddings (in turquoise). This is sufficient to compare elements from different modes through their relative representations: rr(ŷ j ) is more similar to rr(x * ) than rr(ŷ i ). In this paper, we present ASIF, a simple procedure that turns pre-trained uni-modal image and text models into a multi-modal model using a relatively smallfoot_0 multi-modal data set and no additional training as shown in Figure 1 . The resulting model aligns latent representations of images and text, behaving as if it was contrastively trained on multi-modal data like CLIP or LiT. The key intuition is that captions of similar images should be themselves similar, and therefore a representation crafted using just similarities to ground-truth multimodal pairs is quasi mode-invariant. Our results are surprising and raise several questions. Despite (1) the simplicity of the approach, (2) a multi-modal data set that is up to 250 times smaller than in prior work and (3) the lack of actually training the model on multi-modal data, ASIF achieves zero-shot classification accuracy on downstream data sets that is roughly in the same ballpark of CLIP (Radford et al., 2021; Zhai et al., 2022) . This raises important questions on the data efficiency in foundation models, making ASIF a very powerful and cheap baseline for future work, and opening new doors for data centric AI (Ng, 2022). In fact, ASIF comes with several interesting properties by construction. The absence of training makes the model editable: adding or removing image-text pairs and deploying a new multimodal model is a matter of seconds. Moreover, the representations are highly interpretable, since every dimension corresponds to the similarity of the input to a unique entry in the multimodal dataset. In summary, we: • Introduce the ASIF procedure, which turns two pretrained unimodal black-box encoders into an interpretable multimodal model without tuning a neuron and using a "limited" amount of multimodal data. • Demonstrate the effectiveness of ASIF models on zero-shot image classification tasks, where they achieve performance in the same ballpark of CLIP with significantly fewer image-text pairs. • Discuss key properties of ASIF, its implications on the role of memory and retrieval in machine learning, and the new opportunities it opens.

2. ALIGNING PRE-TRAINED MODELS VIA RELATIVE REPRESENTATIONS

In the following we present how a collection of captioned pictures implicitly defines a common space for images and texts through relative representations, allowing to build a multimodal model without training. Before that, we briefly discuss existing techniques to build this common space.



Compared to the data sets used to train state-of-the-art multimodal models, our experiments use 1.6M captioned images.



fact, CLIP's zero-shot classification accuracy on Imagenet (Deng et al., 2009) drops from 76.2 to 31.3 when using a public dataset of "just" 15M pairs (a curated subsampled version of YFCC100m (Thomee et al., 2016)) as opposed to the original private dataset of 400M pairs. Recently, Zhai et al. (2022) showed improvements in data efficiency by only training the text encoder, achieving Imagenet accuracy comparable to CLIP with 10M samples and outperforming it with a larger 901M data set.

