THE PLUG AND PLAY OF LANGUAGE MODELS FOR TEXT-TO-IMAGE GENERATION Anonymous

Abstract

Text-to-image (T2I) models enable controllable image generation through userprovided captions. A text encoder is typically used to map captions to a latent space, and it has been shown to be critical for model's performance. However, replacing or upgrading the text encoder in a T2I model is challenging due to the tight bond between the current encoder and the image decoder. It requires training the model from scratch, which can be prohibitively expensive. To address this problem, we introduce a more efficient approach to align a pre-trained language model with the latent space of an existing T2I model. We propose a Model Translation Network (MTN) and a new training objective to align the representation spaces of the two text encoders using only a corpus of unlabeled text. We empirically find that MTN can be trained efficiently and can boost the performance of existing T2I models by upgrading their text encoder. Moreover, we find that MTN can align multilingual language models such as XLM-Roberta, thus allowing existing T2I models to generate high-quality images from captions beyond English.

1. INTRODUCTION

Text-to-image (T2I) generative models have made great progress in the last few years thanks to algorithmic advances and the availability of large-scale paired training datasets (Ramesh et al., 2022; Yu et al., 2022a; Saharia et al., 2022; Rombach et al., 2022) . Diffusion-based T2I generative models in particular have achieved remarkable results in terms of image quality (Ho & Salimans, 2022; Nichol et al., 2021) . Despite these strong results, controllable generation for these methods is still challenging: generated images are often not faithful to the captions, compositional capabilities are lacking, and prompt engineering is often required to achieve the desired results (Parsons, 2022) . Moreover, most large-scale models have only been trained on English captions, greatly limiting their use across the world. To improve T2I models, recent research suggests that the ability of their text encoders to understand and represent text is critical and is a bottleneck for their image generation performance (Saharia et al., 2022; Croitoru et al., 2022) . Text encoders in existing T2I models are often trained only on short image captions, and their performance on complex prompts can be largely constrained by the quality of the features extracted by the text encoder (Rombach et al., 2022) . However, upgrading the text encoder for an existing T2I model is challenging because the representation spaces of the text encoder and image generator are tightly coupled (Rombach et al., 2022; Ramesh et al., 2022) . Training only the text encoder on a more complex and representative text corpus would break this alignment, hindering the final image generation performance. Training the entire T2I model from scratch (perhaps with higher quality image-caption pairs) would be prohibitively expensive (Edwards, 2022) . 1To solve this problem, we propose a method that can efficiently align off-the-shelf pre-trained language models with image encoders of existing diffusion-based T2I models. With this method, the existing text encoder of T2I models can be replaced with a more powerful language model or even one for a language other than English as illustrated in Fig. 1 . Crucially, the representation alignment between text and image encoders is maintained without retraining from scratch. We refer to our proposed method as Model Translation Network (MTN). MTN follows an encoder-decoder structure. Specifically, given a trained T2I model and a pre-trained language model that we want to replace, the To verify the effectiveness of the proposed framework, we have applied a stronger language model, i.e, T5-3B (Raffel et al., 2020) , to upgrade existing text encoders of the Latent Diffusion Model (Rombach et al., 2022) . The competitive improvements in the FID score and user study ranking reveal the benefits of our model over the baselines. Furthermore, our model has the potential to bring new functionalities into existing T2I models, such as multilingual text-to-image generation. We empirically find that MTN can align a multilingual language model such as XLM-Roberta-L (Conneau et al., 2019) with existing T2I models. It enables the existing image generator to understand text beyond English, such as French and Chinese, and to generate high-quality images accordingly. Our contributions can be summarized as follows: • To the best of our knowledge, this is the first work to consider the problem of efficiently aligning a pre-trained language model with a pre-trained T2I diffusion model. • Extensive experiments on text-to-image generation benchmarks demonstrate the superiority of our model over the baseline LDM method on both image quality and language controllability. • Our framework also enables text-to-image generation beyond English prompts without the need of multilingual image-text pairs for retraining. 



The cost of training a Stable Diffusion model is around 600K USD.



Figure 1: Illustration of our desired modularized T2I generation. With the proposed Model Translation Network (MTN), the existing image generators can be bridged to off-the-shelf language models to expand their functionalities, i.e, multilingual generation, within a limited budget.encoder of MNT firstly aligns the representation space of the pre-trained language model with that of the T2I model's image generator, by minimizing both element-wise and global-wise discrepancy. Then, the decoder of MNT takes the aligned text representations as inputs, and maps them back to the original representation space of the pre-trained language model by minimizing the reconstruction loss. The reason why we need this decoder during training is that some recent research reveals that training to align existing models inevitably decreases feature discriminability(Chen et al., 2019;  Cui et al., 2020). Therefore we need to preserve rich semantics captured by the pre-trained language model during the alignment training, ensuring that a decoder can recover the original representation space of the pre-trained language model from the aligned one. The entire training of MTN requires only a corpus of unlabeled text. At inference time, only the encoder of MTN is applied on top of the pre-trained language model for representation alignment.

Stack-GAN (Zhang et al., 2017),Attn-GAN (Xu et al., 2018)  and SD-GAN(Yin et al., 2019), have obtained promising results. In addition, recent works showed more improvements on the generation quality. DM-GAN(Zhu et al., 2019) improved text-to-image performance by introducing a dynamic memory component. DF-GAN (Tao et al., 2022)  designed a fusion module to fuse text and image features. LAFITE (Zhou et al., 2021) took advantage of CLIP's model to construct pseudo imagetext pairs, and proposed a GAN model to learn text-image pairs.

