THE RIGHT LOSSES FOR THE RIGHT GAINS: IMPROV-ING THE SEMANTIC CONSISTENCY OF DEEP TEXT-TO-IMAGE GENERATION WITH DISTRIBUTION-SENSITIVE LOSSES Anonymous

Abstract

One of the major challenges in training deep neural networks for text-to-image generation is the significant linguistic discrepancy between ground-truth captions of each image in most popular datasets. The large difference in the choice of words in such captions results in synthesizing images that are semantically dissimilar to each other and to their ground-truth counterparts. Moreover, existing models either fail to generate the fine-grained details of the image or require a huge number of parameters that renders them inefficient for text-to-image synthesis. To fill this gap in the literature, we propose using the contrastive learning approach with a novel combination of two loss functions: fake-to-fake loss to increase the semantic consistency between generated images of the same caption, and fake-to-real loss to reduce the gap between the distributions of real images and fake ones. We test this approach on two baseline models: SSAGAN and AttnGAN (with style blocks to enhance the fine-grained details of the images.) Results show that our approach improves the qualitative results on AttnGAN with style blocks on the CUB dataset. Additionally, on the challenging COCO dataset, our approach achieves competitive results against the state-of-the-art Lafite model, outperforms the FID scores of SSAGAN and DALL-E models by 44% and 66.83% respectively, yet with only around 1% of the model size and training data of the huge DALL-E model.

1. INTRODUCTION

The main aim behind the Text-to-Image generation (T2I) problem is to synthesize high-quality, photo-realistic images that semantically reflect input textual descriptions. It is a challenging computer vision problem that has many applications, including computer-aided design, image editing, and art generation. Most recent attempts at this problem utilize Generative Adversarial Networks (GANs) as the backbone model. Text-conditioned GANs have proven to be a powerful method to generate high-quality images that are semantically consistent with input captions. In practice, such models generate images that are significantly different from the ground truth. That's because, in most datasets, each image has several human-typed captions that are highly diverse in terms of content and structure. Also, models have to learn to understand two domains: textual description and visual description. Most models in the literature either lack the details in the generated images or generate fine details in the image but with less accurate match to the textual description. The former problem happens due to employing a loss function that ensures sentence and word level matching between the image and the text such as the DAMSM loss (Xu et al., 2018; Ye et al., 2021; Zhu et al., 2019) without designing a generator capable of ensuring details are generated in the image at both fine and coarse grained levels. This leads to mainly washed images that have the general structure that matches the text (e.g., a bird with red wings) without the details that makes the image looks real such as the feathers and the color of the eyes. Moreover, these models mostly use multiple generators like Xu et al. (2018); Zhu et al. (2019); Qiao et al. (2019b) which can increase the aforementioned problem if the progressive growing of the details are not done to ensure that image features are learned well at each step. This happens if the early generated image is poor, affecting the latter stages in the generator network. Another problem with T2I is controllability and forcing the distribution of the generated images to be similar to the real ones. In general, we want similar textual descriptions to have similar image features and slight changes in the text to produce corresponding changes to the image without changing irrelevant features. Moreover, we want to push the generated images to look more like the real ones for all the given captions of the same real image. There are two approaches that can be done simultaneously to ensure these constrains: adding proper loss functions and changing the architecture of the generator like Hu et al. (2021) . This work aims to tackle these two problems by studying two directions: using a Style-based generator with the blocks of StyleGAN (Karras et al., 2020) either using the traditional style generator or fusing the style blocks with another architecture like AttnGAN to show how it will improve an already existing architecture. The aim of style blocks is to produce good fine and coarse grained features as well as give controllability of the generation as it did for its traditional use in the unconditional image generation. The other direction is to introduce contrastive learning in two flavors: real-to-fake contrastive learning and fake-to-fake contrastive learning. The purpose of these loss components is force the fake image distribution to be close to the real ones as well and to maximize the similarity between fake images for similar or the same caption. Experimentation with both directions lead to very promising results. Adding the style component evidently increase the quality of the generated images but produces low variability while adding the two flavors of contrastive losses gives a significant increase in visual quality of the generated images as well as the quality metrics (e.g., FID) when tried on the SSAGAN network (Hu et al., 2021) , making it better than the reported state-of-the-art-models on the CUB birds dataset 2 RELATED WORK 2.1 TEXT-TO-IMAGE GENERATION Great progress has been recently achieved in text-to-image generation by a large number of promising studies (Reed et al., 2016; Zhang et al., 2017; 2018a; Xu et al., 2018; Hong et al., 2018; Zhang et al., 2018b; Qiao et al., 2019b; Zhu et al., 2019; Yin et al., 2019; Li et al., 2019b; Tao et al., 2020; Qiao et al., 2019a; Li et al., 2019a; Cha et al., 2019; Hinz et al., 2019; El-Nouby et al., 2019; Liang et al., 2020; Cheng et al., 2020; Qiao et al., 2019b; Ramesh et al., 2021; Zhang et al., 2017; 2021) , most of which employ GANs as the backbone model. In this section, we provide a summary of some of the most famous and relevant models. AttnGAN (Xu et al., 2018) utilizes attention to compute the similarity between the synthesized images and their corresponding captions using Deep Attentional Multimodal Similarity Model (DAMSM) loss. In this method, both the sentence and word-level information are used to compute the DAMSM loss. The stacked GAN architecture, proposed by Zhang et al. ( 2017), generates images incrementally from low-resolution to high-resolution. DM-GAN (Zhu et al., 2019) generates high-quality images using a dynamic memory GAN which refines the initial generated images. It then employs a memory writing gate to give more weight to relevant words and a response gate to enhance image representations accordingly. SD-GAN (Yin et al., 2019) uses a Siamese structure that takes a pair of captions as input and employs contrastive loss to train the model. For fine-grained image generation, SD-GAN adopts conditional batch normalization. Contrastive loss is also utilized in XMC-GAN (Zhang et al., 2021) but, unlike SD-GAN, they reduce training complexity by not requiring mining for information negatives. CP-GAN (Liang et al., 2019) adopts an object-aware image encoder together with a fine-grained discriminator for higher-quality image generation. While it achieves a promising Inception Score (Salimans et al., 2016) , it has been shown to perform poorly when evaluated with a stronger FID (Heusel et al., 2017) metric. These approaches utilize several generators and discriminators to ultimately synthesize images at high resolutions. Other approaches have proposed inferring semantic layouts and explicitly generating different objects in hierarchical models (Hong et al., 2018; Hinz et al., 2019; Koh et al., 2021) . In such models, generation is a multi-step process that requires more detailed labels (e.g., segmentation maps and bounding boxes), which represents a significant drawback. To tackle these issues, we employ a single generator and discriminator architecture in our model that is end-to-end trainable and generates much higher quality images.

