ASIF: COUPLED DATA TURNS UNIMODAL MODELS TO MULTIMODAL WITHOUT TRAINING

Abstract

Aligning the visual and language spaces requires to train deep neural networks from scratch on giant multimodal datasets; CLIP (Radford et al., 2021) trains both an image and a text encoder, while LiT (Zhai et al., 2022) manages to train just the latter by taking advantage of a pretrained vision network. In this paper, we show that sparse relative representations are sufficient to align text and images without training any network. Our method relies on readily available single-domain encoders (trained with or without supervision) and a modest (in comparison) number of image-text pairs. ASIF redefines what constitutes a multimodal model by explicitly disentangling memory from processing: here the model is defined by the embedded pairs of all the entries in the multimodal dataset, in addition to the parameters of the two encoders. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multimodal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.

1. INTRODUCTION

Training models at such scale presents several challenges beside the obvious infrastructure and training costs. Notably, it requires collecting massive training sets, making it difficult to interpret the predictions of the model in light of their training data. Additionally, the training assets are often not owned by the institution training the model (Sun et al., 2017) . This introduces several additional challenges, from reproducibility to the difficulty of ensuring that an asset owner can remove their data from the model (Golatkar et al., 2021; 2020b; a; Ginart et al., 2019; Guo et al., 2020) . Overall, these considerations make large multi-modal models relatively inaccessible to researchers and practitioners until checkpoints are released or access to demo is granted.



Figure 1: ASIF is a simple recipe to align the representations of two frozen pre-trained models exploiting the fact that relative distances are preserved across different modes: the captions of similar images are themselves similar. Large multimodal models such as CLIP (Radford et al., 2021) are rapidly becoming the standard for foundation models (Bommasani et al., 2021) in computer vision. This is largely due to their zero-shot and open-world capabilities that enable diverse suites of downstream tasks, from classification to detection and visual search. Overall, Radford et al. (2021) demonstrated that scale is the key ingredient for building a common latent space for images and text, and is sufficient to convincingly solve a multitude of tasks without training explicitly for them. In fact, CLIP's zero-shot classification accuracy on Imagenet (Deng et al., 2009) drops from 76.2 to 31.3 when using a public dataset of "just" 15M pairs (a curated subsampled version of YFCC100m (Thomee et al., 2016)) as opposed to the original private dataset of 400M pairs. Recently, Zhai et al. (2022) showed improvements in data efficiency by only training the text encoder, achieving Imagenet accuracy comparable to CLIP with 10M samples and outperforming it with a larger 901M data set.

