CONNECTING REPRESENTATION AND GENERATION VIA MASKED VISION-LANGUAGE TRANSFORMER

Abstract

Recently, there has been great progress in the self-supervised pre-training of multimodal representation models that understand image and language jointly. One particularly popular application of such models is text-to-image generation, which is typically obtained via a two-stage process: in the first stage, a representation model is trained via self-supervised objectives; then in the second stage, a conditional generative decoder is trained on top of the representation to generate natural images. In this work, we aim at bringing representation learning and conditional generation together by unifying the two stages into a single model and training objective. We present UPGen, a unified pre-trained model for both representation learning and generation. UPGen is trained with a simple masked token prediction objective on a flexible mixture of image and language data. We use a pre-trained VQGAN image tokenizer to convert images into discrete tokens, then train a masked token prediction model on both paired image-text datasets and unpaired language datasets, using randomly sampled mask ratios. We show that this masked token prediction model can be directly used to generate images and language by iteratively re-masking and predicting the masked tokens. We demonstrate empirically that UPGen serves as both a good representation learning model and a generative model for both image and language.

1. INTRODUCTION

With the rapid improvement of deep learning architecture and accelerator hardware, researchers have made significant progress in self-supervised representation learning from image and text data (Radford et al., 2021; Geng et al., 2022; Mu et al., 2021; Wang et al., 2022) . Such models are trained to jointly understand language and image data and learn generalizable representations that transfer across image and language modalities, and thus can be applied to a wide variety of downstream tasks. One interesting task that has gained popularity recently is text-to-image generation, where the model is given a text prompt and generates an image that corresponds to the text prompt's description (Saharia et al., 2022; Ramesh et al., 2021; Yu et al., 2022) . This task is particularly attractive because it enables a human to directly interact with the model and inspect its understanding of language and image, providing great tools for artistic creations. Driven by the successes of representation learning, text-to-image models have achieved impressive progress (see e.g., Saharia et al., 2022; Ramesh et al., 2021; Yu et al., 2022) . However, the drawbacks of these text-to-images models are due to the pipeline having two stages. In the first stage, a representation model is trained with self-supervised pre-training objectives. In the second stage, a diffusion (Saharia et al., 2022; Ramesh et al., 2021) or autoregressive (Yu et al., 2022) generative model is trained conditioned on top of the (typically frozen) pre-trained representation. Such a twostage training pipeline requires more hyperparameters to be tuned, and introduces extra complexity in developing models. To address the above limitations, we propose UPGen, a simple but effective framework that unifies pre-training for representation and generation. Using a pre-trained VQGAN model (Esser et al., 2021) , we convert an image into a sequence of discrete tokens and concatenate the image tokens and language tokens into a single sequence. We then train an encoder-only transformer model using a simple masked token prediction objective on the concatenated sequence. The masked token prediction objectives produces good representations for downstream tasks. Furthermore, we Figure 1 : UPGen consists of a pre-trained VQGAN tokenizer that converts images into discrete tokens and an encoder-only transformer that processes the image tokens and language tokens jointly. The concatenated image and language sequence are randomly masked according to a uniformly sampled ratio, and the transformer is trained to predict the masked tokens. show that UPGen can be directly used for conditional and unconditional generation of images and language without having to train extra components. To achieve this, we apply an iterative refinement strategy by repeatedly re-masking and regenerating the masked tokens, following the approach of MaskGIT (Chang et al., 2022) . In this work, we provide a large-scale empirical study of UPGen on a mixture of paired image-text datasets and unpaired language datasets. We find that UPGen learns generalizable representations that transfer to a wide variety of tasks, including image classification, text-guided image inpainting, and image-to-text generation. We also demonstrate that UPGen can generate high-quality images with and without language conditioning on language prompts. While achieving competitive results, UPGen does not perform as well as state-of-the-art methods on each of the downstream tasks due to being trained on much smaller datasets. However, to the best of our knowledge, UPGen is the first model that combines representation learning, image-to-text generation, text-to-image-generation, and unconditioned image-generation into a single model and training objective. Scaling UPGen to larger datasets and model size is left as promising future work.

2. RELATED WORKS

Self-supervised learning via masked modeling. Ever since the introduction of Transformers (Vaswani et al., 2017) , self-supervised pre-training has made significant progress in the recent years. A particularly popular style of self-supervised learning is masked modeling, where the input example is partially masked and the model is trained to predict the masked part from the unmasked part. Masked modeling first sees its success in natural language processing (NLP), with large language models like BERT (Devlin et al., 2018 ), RoBERTa (Liu et al., 2019 ), T5 (Raffel et al., 2020) and UL2 (Tay et al., 2022) that learns highly generalizable representations that transfer well to various of downstream tasks. Inspired by the effectiveness of these NLP models, researchers have taken the masked modeling ideas into computer vision and also met with great success. Vision transformers (Dosovitskiy et al., 2020) introduces the transformer architecture into computer vision and make the natural analogy between language tokens and patches of natural images. Building on this work, BEiT (Bao et al., 2021) proposes to apply the masked token prediction on vision transformers, taking image patches as input and predicting the output of a discrete tokens produced by a VQVAE (Van Den Oord et al., 2017 ) model. Recently, MAE (He et al., 2022) further removes the need of a discrete image tokenizer by directly predicting the patch pixel output using an encoder-decoder architecture. Our UPGen first tokenizes the image into discrete tokens by using a pre-trained VQGAN (Esser et al., 

