PROBING INTO THE FINE-GRAINED MANIFESTATION IN MULTI-MODAL IMAGE SYNTHESIS Anonymous authors Paper under double-blind review

Abstract

The ever-growing development of multi-modal image synthesis brings unprecedented realism to generation tasks. In practice, it is straightforward to judge the visual quality and reality of an image. However, it is labor-consuming to verify the correctness of semantic consistency in the auto-generation, which requires a comprehensive understanding and mapping of different modalities. The results of existing models are sorted and displayed largely relying on the global visualtext similarity. However, this coarse-grained approach does not capture the finegrained semantic alignment between image regions and text spans. To address this issue, we first present a new method to evaluate the cross-modal consistency by inspecting the decomposed semantic concepts. We then introduce a new metric, called MIS-Score, which is designed to measure the fine-grained semantic alignment between a prompt and its generation quantitatively. Moreover, we have also developed an automated robustness testing technique with referential transforms to test and measure the robustness of multi-modal synthesis models. We have conducted comprehensive experiments to evaluate the performance of recent popular models for text-to-image generation. Our study demonstrates that the proposed metric MIS-Score represents better evaluation criteria than existing coarse-grained ones (e.g., CLIP) to understand the semantic consistency of the synthesized results. Our robustness testing method also proves the existence of biases embedded in the models, hence uncovering their limitations in real applications. A car besides a blue cube CLIP Score MIS Score

1. INTRODUCTION

Multi-modal image synthesis (Esser et al., 2021; Ramesh et al., 2021; Rombach et al., 2022) aims to generate images given input prompts such as natural language descriptions or a set of keywords. This multi-modal synthesis has a variety of potential applications, e.g., computer-aided design, textguided photo editing, etc. In recent few years, it has witnessed unprecedented development based on the advance of deep learning techniques like Generative Adversarial Networks (GANs) (Goodfellow et al., 2020; Zhang et al., 2017; Tao et al., 2020) and cross-modal pre-training models, e.g., CLIP (Radford et al., 2021; Ramesh et al., 2021) . Compared to the visual clues (segmentation maps, regional edges, etc.) adopted for single-modal image synthesis, cross-modal guidance, e.g., language descriptions, provides an alternative but more intuitive and natural ways to express semantic concepts. The flexibility of text-to-image generation greatly lower the barrier for a wide range of public users to unleash their creativity in image generation and editing. However, due to the substantial domain gap, effective transfer and fusion of heterogeneous information from different modalities remains a big challenge in the multi-modal synthesis tasks. Moreover, there usually exist many-to-many mapping relationships in the multi-modal image synthesis tasks, i.e., one image may correspond to multiple textual descriptions counterparts and vice versa. In practice, the synthesized images are not always consistent with the text prompts given by users. Additional effort is usually required to manually validate and select the satisfying synthesized results. Various aspects of the synthesized images need to be qualified, e.g., visual quality and the accuracy of objects, attributes, and contextual information.Thus, an automatic evaluation process and a comprehensive and objective evaluation metric are of vital importance in assessing the effectiveness of multi-modal image synthesis models. To achieve these goals, previous works usually report multiple metrics to cover different aspects. The primary factors considered in the evaluation process are image quality and text-image similarity. Popular metrics include Inception Score (IS) (Salimans et al., 2016) and Frechet Inception Distance (FID) (Heusel et al., 2017) for assessing the image fidelity, and R-precision(RP) (Xu et al., 2018) and CLIP score (Radford et al., 2021) for measuring the cross-modal alignment. These metrics work well for the generation from simple prompts, e.g., description of a single object. However, for prompts with multiple objects and additional context information, simply adopting these metrics is insufficient and may lead to inaccurate or inconsistent results. As shown in the first row of Fig. 1 , the ranking of the synthesized images based on one of the most common-used metrics (i.e., CLIP score) is not strongly correlated to the language descriptions. Additionally, the existing metrics lack insights in the assessment of fine-grained semantic concepts, such as object existence, attribute accuracy, spatial location, etc. These factors are critical in evaluating the performance of multimodal image synthesis methods, especially when input prompts are composed of multiple semantic concepts. To address the above issues, we introduce MIS-Score, a new metric for Multi-modal Image Synthesis to measure the cross-modal semantic consistency by inspecting and capturing the finegrained mapping across the input language and output vision modalities. As shown in Fig. 1 , given a prompt "A cat lying in a bucket", we first parse the semantic concepts in the language description as ("a cat", "a bucket"). We then perform visual grounding by locating each semantic concept with the visual components in the generated image to calculate the fine-grained text-image alignment. These semantic concepts can include but not limited to: subject, object, location, attribute, and relationship. The cross-modal consistency is measured with MIS-Score by aggregating the alignment score on each semantic concept. Based on the proposed metric, we develop an automatic testing technique, referential transform, to evaluate the robustness of models with different combinations of visual concepts. The key idea is that an accurate multi-modal synthesis model should produce consistent generations given prompts with similar meanings. Correspondingly, models should produce different results given prompts with different semantic concepts. And when there is any mutation in the prompt, the synthesis result is required to always be consistent with the input. In this way, the robustness of models can be evaluated and measured by MIS-Score. Our major contributions in this work are summarized as follows, • To the best of our knowledge, we, for the first time, propose a new approach for measuring the fine-grained semantic consistency for the multi-modal image synthesis tasks.



Figure 1: The generated images of state-of-the-art multi-modal image synthesis models (DALL-E (Cho et al., 2022), Composable Diffusion (Liu et al., 2022), Stable Diffusion (Rombach et al., 2022)) selected by CLIP score and MIS-Score for two prompts: (i) A cat lying in a bucket. (ii) A car besides a blue cube. The green and orange bounding boxes in the prompts and the images indicate the fine-grained alignment of semantic concepts.

