VISUALLY-AUGMENTED PRETRAINED LANGUAGE MODELS FOR NLP TASKS WITHOUT IMAGES

Abstract

Although pre-trained language models (PLMs) have shown impressive performance by text-only self-supervised training, they are found lack of visual semantics or commonsense, e.g., sizes, shapes, and colors of commonplace objects. Existing solutions often rely on explicit images for visual knowledge augmentation (requiring time-consuming retrieval or generation), and they also conduct the augmentation for the whole input text, without considering whether it is actually needed in specific inputs or tasks. To address these issues, we propose a novel visually-augmented fine-tuning approach that can be generally applied to various PLMs or NLP tasks, without using any retrieved or generated images, namely VAWI. Specifically, we first identify the visually-hungry words (VH-words) from input text via a token selector, where three different methods have been proposed, including syntax-, attention-and learning-based strategies. Then, we adopt a fixed CLIP text encoder to generate the visually-augmented representations of these VH-words. As it has been pre-trained by vision-language alignment task on the large-scale corpus, it is capable of injecting visual semantics into the aligned text representations. Finally, the visually-augmented features will be fused and transformed into the pre-designed visual prompts based on VH-words, which can be inserted into PLMs to enrich the visual semantics in word representations. We conduct extensive experiments on ten NLP tasks, i.e., GLUE benchmark, Com-monsenseQA, CommonGen, and SNLI-VE. Experimental results show that our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales, and outperform several competitive baselines significantly. Besides, the generated visual prompts of our framework can also be used for parameter-efficient tuning, which boosts the performance of T5-3B. We will make our code, data, and models publicly available.

1. INTRODUCTION

Recent years have witnessed the success of pre-trained language models (PLMs), such as GPT-3 (Brown et al., 2020) and T5 (Raffel et al., 2020) , in a variety of natural language process (NLP) tasks. Since these PLMs are mostly trained on text-only corpus via self-supervised pre-training, they have been shown lack of visual commonsense (Liu et al., 2022) and real-world knowledge (Zhang et al., 2022) . As a result, PLMs can't well solve visually related language tasksfoot_0 , e.g., answering the color and size of common things, especially those requiring complex commonsense knowledge. To alleviate this problem, existing works mainly enhance PLMs by infusing the visual information. Typically, given a text input, these studies firstly augment the visual information from retrieved or generated images about the input and then leverage their visual representations to improve PLMs on NLP tasks. Such an approach leads to visually-augmented pre-trained language models (VaLMs), where they adopt either visually-augmented pre-training (Tan & Bansal, 2020; Wang et al., 2022) or visually-augmented fine-tuning techniques (Lu et al., 2022) . Despite the effectiveness, there are two major shortcomings in these methods. First, it is very costly to retrieve or generate high-quality images that are paired with the input. These methods often rely on pre-learned complementary retrievers or generators, and also require time-consuming inference for obtaining the images, which largely limits the applicability of this approach. Second, as the augmented visual information comes from retrieved or generated images, it is inevitable to involve irrelevant or redundant visual information. If simply integrating the augmented visual information, the original text representations might be affected or even "polluted". Increasing evidence shows that the visual information is not always useful for NLP tasks (Dai et al., 2022) , and sometimes leads to performance degradation. Considering these issues, we aim to develop a more efficient and effective way to visually augment the PLMs and the solution is twofold: • Firstly, we don't explicitly produce (retrieve or generate) the images but instead generate visuallyaligned representations of the text on-the-fly. Recent studies (Radford et al., 2021; Jia et al., 2021) have shown that the vision-language pre-trained models (VL-PTMs) can well learn the alignment between the representations of texts and images from large-scale text-image pairs. Thus, our idea is to employ the output representations of a text from VL-PTMs' text encoders as a surrogate for the visual representations of related images. Such a way is simple and efficient: we can only keep the text encoder of a VL-PTM to produce the visually-aligned representations of texts, getting rid of the complicated image retrieval or generation process. It is widely recognized that there is a large semantic gap between different modalities (Liang et al., 2022) . Our method can alleviate this issue to some extent since the visual augmentations are derived from the text representation itself, i.e., visually-aligned text representations from VL-PTMs. • Secondly, instead of directly feeding visual augmentations into the PLM, we propose to use the augmented visual information only when it actually needs. In fact, for a text input of a NLP task, PLMs are not always hungry for the visual background knowledge to effectively understand it, especially for visually-irrelevant expressions. Unlike previous works which inject visual information into a text (Tan & Bansal, 2020; Wang et al., 2022) from the whole, we consider identifying visually-hungry words (those that require visual knowledge to derive complete semantics) from the text input, and only infuse the visual augmentations through these trigger words. We conduct visual augmentations at the word level, because it is more flexible and controllable, considering the augmented information is often irrelevant or noisy. To this end, in this paper, we propose a general Visually-Augmented fine-tuning approach to improving PLMs for NLP tasks Without Images, namely VAWI. Our approach consists of three ingredients, namely visually-hungry words extraction, visual knowledge augmentation, and visually-enhanced fine-tuning. Given the text input from a NLP task, we first extract the visually-hungry words (VHwords) from the input sentence. As the annotations of VH-words are generally unavailable, we propose three strategies to automatically extract the VH-words, relying on the syntax trees, attention distributions of the VL-PTMs' text encoder, and an adaptive learnable module, respectively. Then, based on the extracted VH-words, we leverage the text encoder of CLIP (Radford et al., 2021 ) (being fixed in our approach), a VL-PTM that has been pre-trained on millions of text-image pairs, to encode the VH-words for obtaining their visually-aligned representations. Finally, we design visually-enhanced fine-tuning strategies to infuse the visually-aligned representations into PLMs. For small PLMs, we directly incorporate these visual representations to enrich the word embeddings and fine-tune the parameters of the PLM and our approach. For large-scale PLMs, we also propose a parameter-efficient prompt-tuning strategy that only tunes very few parameters in our approach, with the PLM being frozen. To summarize, our approach provides an adaptive, flexible and efficient way to leverage visual information for enhancing text-based PLMs. For verifying the effectiveness of our framework VAWI, we test it on four PLMs (i.e., BERT, BART, RoBERTa, and T5) at different scales (i.e., 110M, 340M, 3B), and conduct extensive experiments in natural language understanding, commonsense reasoning, and text generation tasks. Experimental results show that our VAWI can boost the performance of these PLMs significantly, i.e., 3.11%, 2.54%, and 2.16% absolute improvements on the com-monsenseQA task using RoBERTa-base, RoBERTa-large, and T5-3b, respectively. Besides, VAWI can outperform (or be on par with) several competitive baselines that adopt complicated visuallyaugmented methods. Additionally, VAWI further improves the performance of a VL-PTM (i.e., ALBEF(Li et al., 2021) ) on the cross-modal reasoning task by enhancing its text-encoder.



In this work, we mainly consider the text-only NLP tasks that may need the visual information as complementary, but not the visual-language tasks with images.

