WRITE AND PAINT: GENERATIVE VISION-LANGUAGE MODELS ARE UNIFIED MODAL LEARNERS

Abstract

Recent advances in vision-language pre-training have pushed the state-of-the-art on various vision-language tasks, making machines more capable of multi-modal writing (image-to-text generation) and painting (text-to-image generation). However, few studies investigate if these two essential capabilities can be learned together and boost each other, making a versatile and powerful multi-modal foundation model. In this work, we disclose the potential of symmetric generative vision-language pre-training in learning to write and paint concurrently, and propose a new unified modal model, named DAVINCI, trained with prefix language modeling and prefix image modeling, a simple generative self-supervised objective on image-text pairs. Thanks to the proposed prefix multi-modal modeling framework, DAVINCI is simple to train, scalable to huge data, adaptable to both writing and painting tasks, and also strong on other vision, text, and multi-modal understanding tasks. DAVINCI achieves competitive performance on a wide range of 27 generation/understanding tasks and demonstrates the superiority of combining vision/language generative pre-training. Furthermore, we carefully benchmark the performance of different vision-language pre-training objectives on different scales of pre-training datasets on a heterogeneous and broad distribution coverage. Our results demonstrate the potential of exploiting self-supervision in both language and vision inputs, and establish new, stronger baselines for future comparisons at different data scales.

1. INTRODUCTION

Self-supervised language model pre-training (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2019; Liu et al., 2019; Lewis et al., 2020; Raffel et al., 2020; Brown et al., 2020; Fu et al., 2022; Zhou et al., 2021b; Diao et al., 2020; 2021; Zhou et al., 2021a; Xu et al., 2020; Zhou et al., 2020; 2022a; Pan et al., 2022; Diao et al., 2023) has reshaped the landscape of modern natural language processing (NLP) research, pushing the state-of-the-art of a wide range of NLP tasks. Recently, this success has been transferred to the multi-modal context and resulted in a number of vision-language pretrained models (VLMs) (Lu et al., 2019; Tan & Bansal, 2019a) , achieving state-of-the-art results on various vision-language tasks. Most existing VLMs are BERT-like Transformer (Vaswani et al., 2017) encoders pre-trained with a combination of different vision-language pre-training (VLP) objectives: masked multi-modal modeling (Lu et al., 2019; Tan & Bansal, 2019b; Chen et al., 2020; Li et al., 2020) , multi-modal alignment prediction (Lu et al., 2019; Tan & Bansal, 2019b; Chen et al., 2020; Li et al., 2020) , region of interest feature regression (Tan & Bansal, 2019b), image-text matching (Li et al., 2021; Zeng et al., 2021) , to name a few. However, the roadmap towards large language models reveals a transition pattern from encoder-only models like BERT (Devlin et al., 2019) / RoBERTa (Liu et al., 2019) to sequence-to-sequence models like T5 (Raffel et al., 2020) / BART (Lewis et al., 2020) and autoregressive models like GPT-3 (Brown et al., 2020 ) / PaLM (Chowdhery et al., 2022) to tackle

