CONNECTING REPRESENTATION AND GENERATION VIA MASKED VISION-LANGUAGE TRANSFORMER

Abstract

Recently, there has been great progress in the self-supervised pre-training of multimodal representation models that understand image and language jointly. One particularly popular application of such models is text-to-image generation, which is typically obtained via a two-stage process: in the first stage, a representation model is trained via self-supervised objectives; then in the second stage, a conditional generative decoder is trained on top of the representation to generate natural images. In this work, we aim at bringing representation learning and conditional generation together by unifying the two stages into a single model and training objective. We present UPGen, a unified pre-trained model for both representation learning and generation. UPGen is trained with a simple masked token prediction objective on a flexible mixture of image and language data. We use a pre-trained VQGAN image tokenizer to convert images into discrete tokens, then train a masked token prediction model on both paired image-text datasets and unpaired language datasets, using randomly sampled mask ratios. We show that this masked token prediction model can be directly used to generate images and language by iteratively re-masking and predicting the masked tokens. We demonstrate empirically that UPGen serves as both a good representation learning model and a generative model for both image and language.

1. INTRODUCTION

With the rapid improvement of deep learning architecture and accelerator hardware, researchers have made significant progress in self-supervised representation learning from image and text data (Radford et al., 2021; Geng et al., 2022; Mu et al., 2021; Wang et al., 2022) . Such models are trained to jointly understand language and image data and learn generalizable representations that transfer across image and language modalities, and thus can be applied to a wide variety of downstream tasks. One interesting task that has gained popularity recently is text-to-image generation, where the model is given a text prompt and generates an image that corresponds to the text prompt's description (Saharia et al., 2022; Ramesh et al., 2021; Yu et al., 2022) . This task is particularly attractive because it enables a human to directly interact with the model and inspect its understanding of language and image, providing great tools for artistic creations. Driven by the successes of representation learning, text-to-image models have achieved impressive progress (see e.g., Saharia et al., 2022; Ramesh et al., 2021; Yu et al., 2022) . However, the drawbacks of these text-to-images models are due to the pipeline having two stages. In the first stage, a representation model is trained with self-supervised pre-training objectives. In the second stage, a diffusion (Saharia et al., 2022; Ramesh et al., 2021) or autoregressive (Yu et al., 2022) generative model is trained conditioned on top of the (typically frozen) pre-trained representation. Such a twostage training pipeline requires more hyperparameters to be tuned, and introduces extra complexity in developing models. To address the above limitations, we propose UPGen, a simple but effective framework that unifies pre-training for representation and generation. Using a pre-trained VQGAN model (Esser et al., 2021) , we convert an image into a sequence of discrete tokens and concatenate the image tokens and language tokens into a single sequence. We then train an encoder-only transformer model using a simple masked token prediction objective on the concatenated sequence. The masked token prediction objectives produces good representations for downstream tasks. Furthermore, we

