LINEARLY MAPPING FROM IMAGE TO TEXT SPACE

Abstract

The extent to which text-only language models (LMs) learn to represent features of the non-linguistic world is an open question. Prior work has shown that pretrained LMs can be taught to caption images when a vision model's parameters are optimized to encode images in the language space. We test a stronger hypothesis: that the conceptual representations learned by frozen text-only models and vision-only models are similar enough that this can be achieved with a linear map. We show that the image representations from vision models can be transferred as continuous prompts to frozen LMs by training only a single linear projection. Using these to prompt the LM achieves competitive performance on captioning and visual question answering tasks compared to models that tune both the image encoder and text decoder (such as the MAGMA model). We compare three image encoders with increasing amounts of linguistic supervision seen during pretraining: BEIT (no linguistic information), NF-ResNET (lexical category information), and CLIP (full natural language descriptions). We find that all three encoders perform equally well at transferring visual property information to the language model (e.g., whether an animal is large or small), but that image encoders pretrained with linguistic supervision more saliently encode category information (e.g., distinguishing hippo vs. elephant) and thus perform significantly better on benchmark language-and-vision tasks. Our results indicate that LMs encode conceptual information structurally similarly to vision-based models, even those that are solely trained on images.

1. INTRODUCTION

Much recent work in NLP has revolved around studying the limits on representational capacity incurred by training on form-only text data, as discussed in Bender & Koller (2020) . Tied to this argument is the idea that without explicit grounding, language models are not inclined to learn conceptual representations of language that reflect the rich conceptual knowledge that humans gain from interacting with the physical, non-linguistic world. Despite this, there have been remarkable findings in large language models' abilities to generalize to and reason about non-linguistic phenomena (Tsimpoukelli et al., 2021; Eichenberg et al., 2021; Li et al., 2021; Patel & Pavlick, 2022) between language model and image encoder representations: that these conceptual representations can be approximately mapped to one another through a linear transformation. To do this, we train a single linear layer to project from the representation space of images into the language space of a generative LM without tuning any other model parameters, which we call LiMBeR: Linearly Mapping Between Representation spaces. That is, we linearly transform an image representation into "soft prompts"-vector(s) in the embedding space that do not correspond to discrete language tokens (Lester et al., 2021) . The weights of this linear projection are tuned for an image captioning task (illustrated in Figure 1 ). We can then evaluate its performance on vision-language (VL) tasks at test time by exploring the text the LM generates. Because of the simplicity of the linear transformation, we would expect that if the conceptual representation spaces of the two models are structured similarly, this transfer will be successful and the LM will have little trouble describing the contents of images. We use three different image encoders with increasing levels of linguistic supervision in pretraining: BEIT (Bao et al., 2021) , Normalizer Free Resnet50 (NFRN50) (Brock et al., 2021) , and CLIP (Radford et al., 2021) to train different projections into the LM. By linguistic supervision, we refer to the extent to which the image encoder was exposed to language data during its pretraining, thus influencing the expected representational similarity between it and an LM. While CLIP was pretrained to align images with full natural language captions in a shared image-text representation space, BEIT had no exposure to language and was trained by predicting the contents of masked out sections of images. NFRN50 falls somewhere between these extremes: having been pretrained on an image classification task for identifying the subject of an image over the set of classes in ImageNet1k Russakovsky et al. (2015) . Although there is no natural language in this task, the pretraining objective encourages the model to map visual features along lexical categorical concepts (the image classes) derived from the WordNet hierarchy (Miller, 1995) . We show that prompting an LM with any of the three image encoders effectively transfers semantic content in the image that the LM describes with natural language. However, performance also appears proportional to the strength of the linguistic supervision the image encoder had. While CLIP and NFRN50 perform competitively with tuning the models freely (e.g., Tsimpoukelli et al. ( 2021), Eichenberg et al. ( 2021)), BEIT appears to transfer mostly coarse-grained visual properties and struggles with encouraging the LM to generate exact lexical categories. We interpret this as evidence that models trained on either language or vision data learn conceptual spaces that are structurally similar to each other, but that the exact degree of similarity depends on the type of supervision the image encoder receives. In summary, we show: (1) that visual semantic information can be linearly mapped to language models in the form of soft prompts without tuning any model parameters. (2) That this mapping allows generative models to describe images and answer questions about images at a level that is comparable to what is achieved by multimodal models which tune image and language representations jointly. And (3) by training our prompting pipeline with different image encoder backbones, we demonstrate that linguistic supervision in pretraining plays a key role in concept formation in models and thus, the transferability of visual features from vision to text spaces.

2. RELATED WORK

Our approach takes inspiration from recent work in adapting pretrained language models for accepting representations of images as inputs. Particularly, the Frozen and MAGMA models (Tsimpoukelli et al., 2021; Eichenberg et al., 2021) 



Figure1: We train linear projections from image representations into the input space of a language model to produce captions describing images. We find that LMs can describe the contents of most image representations, but performance varies based on the type of image encoder used.

These approaches either fine-tune the pretrained models, or train non-linear MLP projection/fusion networks between modalities, making interpretation of the representations difficult compared to our approach.Scialom et al. (2020)  show a learned linear transformation is sufficient for BERT to encode image region representations which are then fed to a text decoder to generate questions about the image, but it is not well understood what abstractions LMs are able to transfer from a transformation of this type, or if a text decoder can operate on linear transformations of visual encodings directly.

availability

https://github.com/jmerullo/limber 

