WRITE AND PAINT: GENERATIVE VISION-LANGUAGE MODELS ARE UNIFIED MODAL LEARNERS

Abstract

Recent advances in vision-language pre-training have pushed the state-of-the-art on various vision-language tasks, making machines more capable of multi-modal writing (image-to-text generation) and painting (text-to-image generation). However, few studies investigate if these two essential capabilities can be learned together and boost each other, making a versatile and powerful multi-modal foundation model. In this work, we disclose the potential of symmetric generative vision-language pre-training in learning to write and paint concurrently, and propose a new unified modal model, named DAVINCI, trained with prefix language modeling and prefix image modeling, a simple generative self-supervised objective on image-text pairs. Thanks to the proposed prefix multi-modal modeling framework, DAVINCI is simple to train, scalable to huge data, adaptable to both writing and painting tasks, and also strong on other vision, text, and multi-modal understanding tasks. DAVINCI achieves competitive performance on a wide range of 27 generation/understanding tasks and demonstrates the superiority of combining vision/language generative pre-training. Furthermore, we carefully benchmark the performance of different vision-language pre-training objectives on different scales of pre-training datasets on a heterogeneous and broad distribution coverage. Our results demonstrate the potential of exploiting self-supervision in both language and vision inputs, and establish new, stronger baselines for future comparisons at different data scales.

1. INTRODUCTION

Self-supervised language model pre-training (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2019; Liu et al., 2019; Lewis et al., 2020; Raffel et al., 2020; Brown et al., 2020; Fu et al., 2022; Zhou et al., 2021b; Diao et al., 2020; 2021; Zhou et al., 2021a; Xu et al., 2020; Zhou et al., 2020; 2022a; Pan et al., 2022; Diao et al., 2023) has reshaped the landscape of modern natural language processing (NLP) research, pushing the state-of-the-art of a wide range of NLP tasks. Recently, this success has been transferred to the multi-modal context and resulted in a number of vision-language pretrained models (VLMs) (Lu et al., 2019; Tan & Bansal, 2019a) , achieving state-of-the-art results on various vision-language tasks. Most existing VLMs are BERT-like Transformer (Vaswani et al., 2017) encoders pre-trained with a combination of different vision-language pre-training (VLP) objectives: masked multi-modal modeling (Lu et al., 2019; Tan & Bansal, 2019b; Chen et al., 2020; Li et al., 2020) , multi-modal alignment prediction (Lu et al., 2019; Tan & Bansal, 2019b; Chen et al., 2020; Li et al., 2020) , region of interest feature regression (Tan & Bansal, 2019b) , image-text matching (Li et al., 2021; Zeng et al., 2021) , to name a few. However, the roadmap towards large language models reveals a transition pattern from encoder-only models like BERT (Devlin et al., 2019) / RoBERTa (Liu et al., 2019) to sequence-to-sequence models like T5 (Raffel et al., 2020) / BART (Lewis et al., 2020) and autoregressive models like GPT-3 (Brown et al., 2020 ) / PaLM (Chowdhery et al., 2022) to tackle more tasks in a unified way, and from complicated objectives like masked language modeling / next sentence prediction / replace token detection to a simple language modeling objective to improve the scalability of pre-training. This suggests that the generative pre-training paradigm with simple targets shows great potential for pre-training more scalable and general VLMs. To this end, several recent studies (Cho et al., 2021; Zhang et al., 2021a; Wang et al., 2021b; 2022) investigated sequence-to-sequence (seq2seq) vision-language pre-training and achieved state-ofthe-art results on a range of vision-language understanding and generation tasks. For example, VL-T5 (Cho et al., 2021) , OFA (Wang et al., 2022) and PaLI (Chen et al., 2022) formulate various vision-and-language problems into seq2seq tasks and pre-train a seq2seq VLM by multi-tasking on these tasks. In addition, ERNIE-ViLG (Zhang et al., 2021a) and SimVLM (Wang et al., 2021b) pre-train seq2seq VLMs with a simple language modeling or prefix language modeling objective on a large number of image-caption pairs. While achieving promising results, these objectives are not versatile enough, resulting in VLMs that are only capable of a subset of tasks in image-text modalities. On the other hand, the recent success of generative language pre-training (Brown et al., 2020) and generative vision pre-training (He et al., 2022; Bao et al., 2021) motivates us to explore generative vision-language pre-training to learn more versatile and scalable vision-language models. In this work, we introduce prefix multi-modal modeling, a unified generative pre-training framework that extends prefix language modeling to the multi-modal context and learns a multi-modal foundation model by learning to write and paint simultaneously. As illustrated in Figure 1 , given an imagecaption pair, we split the image and caption into two parts denoted as prefix and suffix. To make prefix image modeling compatible with the seq2seq formulation of conventional prefix language modeling, we follow DALLE (Ramesh et al., 2021) and convert images into discrete sequences of image tokens (van den Oord et al., 2017) . We then train the model to generate the suffix in one modality based on the prefix in the same modality and the complete input in the other modality. In this way, prefix multi-modal modeling can fully exploit self-supervision from large-scale image-caption pairs by learning to write and paint simultaneously. We pre-train DAVINCIfoot_0 , a vision-language foundation model, with the proposed prefix multi-modal modeling framework on large-scale image-text pairs. DAVINCI is the first self-supervised vision-language foundation model that is versatile for all kinds of tasks in vision-and-language modalities, including image-to-text generation, text-to-image generation, vision-language understanding, and single-modal language / vision tasks. DAVINCI consistently outperforms FLAVA (Singh et al., 2021) , an existing vision-language foundation model, on both language, vision, and multi-modal tasks, and performs competitively with state-of-the-art models across a wide range of tasks and modalities. Moreover, DAVINCI also shows strong few-shot and zero-shot image/text generation capability. In addition, most existing VLMs are pre-trained with mixed pre-training objectives and different data sources varying in size, making it difficult to disentangle the impact of pre-training objectives and data sources on the downstream tasks. To this end, we conduct a systematic analysis of the performance of generative vision-language pre-training by carefully ablating different pre-training objectives, such as prefix language / image modeling, and the amount of pre-training data with different qualities, revealing the impact of different objectives and data sources to facilitating future research. To summarize, our contribution is three-fold: (1) We introduce prefix multi-modal modeling, a simple unified generative vision-language pre-training framework that is scalable for large-scale pre-training and versatile for image-to-text generation, text-to-image generation and various multi-modal / singlemodal understanding tasks. (2) We pre-train DAVINCI, a vision-language foundation model, with the proposed approach, demonstrating competitive performance on a wide range of 27 downstream tasks and the superiority of combining vision/language generative pre-training. (3) We conduct an analysis about the impact of different pre-training data sources and pre-training objectives on the performance of seq2seq VLMs.

2. RELATED WORK

Inspired by the success of language model pre-training, several studies investigated vision-language pre-training on large-scale image-caption pairs. ViLBERT (Lu et al., 2019) and LXMERT (Tan & Bansal, 2019b) first propose to extract visual object features with an external object detection model like Fast-RCNN (Girshick, 2015) , feed the image features together with texts into Transformer



Named after the Italian polymath Leonardo da Vinci, who displayed infinite grace in everything. We noticed that this name is used in GPT-3 versioning. However, we think there is no conflict because it is only a suffix for a specific checkpoint of the GPT-3 family.

