THE RIGHT LOSSES FOR THE RIGHT GAINS: IMPROV-ING THE SEMANTIC CONSISTENCY OF DEEP TEXT-TO-IMAGE GENERATION WITH DISTRIBUTION-SENSITIVE LOSSES Anonymous

Abstract

One of the major challenges in training deep neural networks for text-to-image generation is the significant linguistic discrepancy between ground-truth captions of each image in most popular datasets. The large difference in the choice of words in such captions results in synthesizing images that are semantically dissimilar to each other and to their ground-truth counterparts. Moreover, existing models either fail to generate the fine-grained details of the image or require a huge number of parameters that renders them inefficient for text-to-image synthesis. To fill this gap in the literature, we propose using the contrastive learning approach with a novel combination of two loss functions: fake-to-fake loss to increase the semantic consistency between generated images of the same caption, and fake-to-real loss to reduce the gap between the distributions of real images and fake ones. We test this approach on two baseline models: SSAGAN and AttnGAN (with style blocks to enhance the fine-grained details of the images.) Results show that our approach improves the qualitative results on AttnGAN with style blocks on the CUB dataset. Additionally, on the challenging COCO dataset, our approach achieves competitive results against the state-of-the-art Lafite model, outperforms the FID scores of SSAGAN and DALL-E models by 44% and 66.83% respectively, yet with only around 1% of the model size and training data of the huge DALL-E model.

1. INTRODUCTION

The main aim behind the Text-to-Image generation (T2I) problem is to synthesize high-quality, photo-realistic images that semantically reflect input textual descriptions. It is a challenging computer vision problem that has many applications, including computer-aided design, image editing, and art generation. Most recent attempts at this problem utilize Generative Adversarial Networks (GANs) as the backbone model. Text-conditioned GANs have proven to be a powerful method to generate high-quality images that are semantically consistent with input captions. In practice, such models generate images that are significantly different from the ground truth. That's because, in most datasets, each image has several human-typed captions that are highly diverse in terms of content and structure. Also, models have to learn to understand two domains: textual description and visual description. Most models in the literature either lack the details in the generated images or generate fine details in the image but with less accurate match to the textual description. The former problem happens due to employing a loss function that ensures sentence and word level matching between the image and the text such as the DAMSM loss (Xu et al., 2018; Ye et al., 2021; Zhu et al., 2019) without designing a generator capable of ensuring details are generated in the image at both fine and coarse grained levels. This leads to mainly washed images that have the general structure that matches the text (e.g., a bird with red wings) without the details that makes the image looks real such as the feathers and the color of the eyes. Moreover, these models mostly use multiple generators like Xu et al. ( 2018 



); Zhu et al. (2019); Qiao et al. (2019b) which can increase the aforementioned problem if the progressive growing of the details are not done to ensure that image features are learned well

