LEARNING THE VISUALNESS OF TEXT USING LARGE VISION-LANGUAGE MODELS Anonymous authors Paper under double-blind review

Abstract

Visual text evokes an image in a person's mind, while non-visual text fails to do so. A method to automatically detect visual text will unlock the ability to augment text with relevant images, as neural text-to-image generation and retrieval models operate on the implicit assumption that the input text is visual in nature. We curate a dataset of 3,620 English sentences and their visualness scores provided by multiple human annotators. Additionally, we use documents that contain text and visual assets to create a distantly supervised corpus of document text and associated images. We also propose a fine-tuning strategy that adapts large visionlanguage models like CLIP that assume a one-to-one correspondence between text and image to the task of scoring text visualness from text input alone. Our strategy involves modifying the model's contrastive learning objective to map text identified as non-visual to a common NULL image while matching visual text to their corresponding images in the document. We evaluate the proposed approach on its ability to (i) classify visual and non-visual text accurately, and (ii) attend over words that are identified as visual in psycholinguistic studies. Empirical evaluation indicates that our approach performs better than several heuristics and baseline models for the proposed task. Furthermore, to highlight the importance of modeling the visualness of text, we conduct qualitative analyses of text-to-image generation systems like DALL-E. We release the curated dataset and code. 1

1. INTRODUCTION

People typically communicate knowledge and information textually, but most prefer visually rich content. Text-to-image generation/retrieval models could augment text with appropriate associated images, aiding the creation of appealing and easy-to-understand documents. Recent models like DALL-E (Ramesh et al., 2021a; 2022) and Stable Diffusion (Rombach et al., 2022) work phenomenally well for input text that is carefully constructed to elicit images. However, they cannot handle long text that may or may not evoke a visual image. We introduce the task of quantifying sentence visualness-a term we use interchangeably with imageability-as a necessary first step toward connecting textual documents with visual assets. Consider the following two examples: "The flowerheads of Haemanthus coccineus ..., with scarlet spathe valves on them like bright shaving brushes, make it a striking plant" (V) and "A copyright notice is a notice of statutorily prescribed form that informs users of the underlying claim to copyright ownership in a published work" ( V). While V evokes an image in the reader's mind, V will be considered non-visual by most. Vision-language models like ViLBERT (Lu et al., 2019) , CLIP (Radford et al., 2021), and UNITER (Chen et al., 2020) have achieved remarkable performance on tasks like Visual Question Answering (VQA) (Antol et al., 2015) , cross-modal retrieval (Wang et al., 2016) , and Visual Commonsense Reasoning (VCR) (Zellers et al., 2019) , but it is not clear how well these models can distinguish visual text from non-visual text. Text-to-image generation models like Stable Diffusion, DALL-E, and Imagen (Saharia et al., 2022) would benefit from inferring text visualness before they can generate images to embellish textual documents. In Figure 1a , we demonstrate the need with some examples: text identified to have low visualness leads to irrelevant generations from DALL-E, while text identified to have high visualness leads to the generation of relevant images. Prior approaches quantifying the visualness of text operate on a word or phrase level (Deschacht & Moens, 2007; Jeong et al., 2012) and leverage lexicons that contain human-assigned world-level im- We curate a corpus of 3,260 sentences in English paired with their human ratings for visualness, as well as a noisy-but-large corpus of 48,077 automatic alignments between text and visual assets in documents, including a NULL non-visual image. The textual part of the resulting alignment pairs can be used as examples of visual and non-visual sentences. We propose a fine-tuning strategy for vision-language models like CLIP that allows classification inferences over text-only inputs. Our proposed objective also ensures that the learned embeddings remain usable for downstream tasks like text-to-image retrieval. We compare the performance of our proposed approach against several heuristic and model-based baselines. Our extensive evaluation suggests that our fine-tuning strategy leads to the most accurate visual and non-visual text classifier. Finally, we conduct several analyses to glean insights into the model's learned attention mechanism, text-to-image retrieval abilities, and downstream text-to-image generation capabilities.

2. RELATED WORK

There are two research themes related to our work: (i) large vision-language models and their adaptations to downstream multimodal tasks, and (ii) understanding and quantifying visualness of words. Fine-tuning Vision-Language Models for Downstream Tasks: Vision-Language models aim to process and relate information across the visual and language modalities (Baltrušaitis et al., 2018; Yuan et al., 2021; Radford et al., 2021; Lu et al., 2019; Tan & Bansal, 2019) . Large models like CLIP (Radford et al., 2021) , UNITER (Chen et al., 2020), and ALIGN (Jia et al., 2021) have demonstrated remarkable performance on downstream tasks via transfer learning or fine-tuning. However, such downstream tasks assume both text and image as input to determine similarity or generate/retrieve the other modality for every instance of the corresponding modality; for instance, visual question answering (Antol et al., 2015) , caption generation (Xu et al., 2015) , and cross-modal retrieval (Wang et al., 2016) . Fine-tuning large vision-language models on such downstream tasks involves adding components to the encoders' architecture and training additional parameters on the task-specific dataset; the additional components could be fusion layers with cross-attention for multimodal classification (Mittal et al., 2022) , or a Transformer-based generation module for caption generation (Sarto et al., 2022) . Transferability and reusability of models and their learned representations to downstream tasks and other domains are also a desirable properties (Yosinski et al., 2014; Long et al., 2015) , especially in light of catastrophic forgetting (Goodfellow et al., 2013) . Our work differs from existing work in that the input is only text, requiring us to adapt large visionlanguage models to not rely on both modalities during inference. We propose a fine-tuning strategy that does not involve additional architectural components (and parameters) on top of a pre-trained CLIP architecture and yet effectively adapts CLIP for learning text visualness. Our task can be



Project webpage: redacted for anonymization



figures performing the intricate maneuvers of tai chi.

