VISUALLY-AUGMENTED PRETRAINED LANGUAGE MODELS FOR NLP TASKS WITHOUT IMAGES

Abstract

Although pre-trained language models (PLMs) have shown impressive performance by text-only self-supervised training, they are found lack of visual semantics or commonsense, e.g., sizes, shapes, and colors of commonplace objects. Existing solutions often rely on explicit images for visual knowledge augmentation (requiring time-consuming retrieval or generation), and they also conduct the augmentation for the whole input text, without considering whether it is actually needed in specific inputs or tasks. To address these issues, we propose a novel visually-augmented fine-tuning approach that can be generally applied to various PLMs or NLP tasks, without using any retrieved or generated images, namely VAWI. Specifically, we first identify the visually-hungry words (VH-words) from input text via a token selector, where three different methods have been proposed, including syntax-, attention-and learning-based strategies. Then, we adopt a fixed CLIP text encoder to generate the visually-augmented representations of these VH-words. As it has been pre-trained by vision-language alignment task on the large-scale corpus, it is capable of injecting visual semantics into the aligned text representations. Finally, the visually-augmented features will be fused and transformed into the pre-designed visual prompts based on VH-words, which can be inserted into PLMs to enrich the visual semantics in word representations. We conduct extensive experiments on ten NLP tasks, i.e., GLUE benchmark, Com-monsenseQA, CommonGen, and SNLI-VE. Experimental results show that our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales, and outperform several competitive baselines significantly. Besides, the generated visual prompts of our framework can also be used for parameter-efficient tuning, which boosts the performance of T5-3B. We will make our code, data, and models publicly available.

1. INTRODUCTION

Recent years have witnessed the success of pre-trained language models (PLMs), such as GPT-3 (Brown et al., 2020) and T5 (Raffel et al., 2020) , in a variety of natural language process (NLP) tasks. Since these PLMs are mostly trained on text-only corpus via self-supervised pre-training, they have been shown lack of visual commonsense (Liu et al., 2022) and real-world knowledge (Zhang et al., 2022) . As a result, PLMs can't well solve visually related language tasksfoot_0 , e.g., answering the color and size of common things, especially those requiring complex commonsense knowledge. To alleviate this problem, existing works mainly enhance PLMs by infusing the visual information. Typically, given a text input, these studies firstly augment the visual information from retrieved or generated images about the input and then leverage their visual representations to improve PLMs on NLP tasks. Such an approach leads to visually-augmented pre-trained language models (VaLMs), where they adopt either visually-augmented pre-training (Tan & Bansal, 2020; Wang et al., 2022) or visually-augmented fine-tuning techniques (Lu et al., 2022) . Despite the effectiveness, there are two major shortcomings in these methods. First, it is very costly to retrieve or generate high-quality images that are paired with the input. These methods often rely on pre-learned complementary retrievers or generators, and also require time-consuming inference for obtaining the images, which



In this work, we mainly consider the text-only NLP tasks that may need the visual information as complementary, but not the visual-language tasks with images.

