PROBING INTO THE FINE-GRAINED MANIFESTATION IN MULTI-MODAL IMAGE SYNTHESIS Anonymous authors Paper under double-blind review

Abstract

The ever-growing development of multi-modal image synthesis brings unprecedented realism to generation tasks. In practice, it is straightforward to judge the visual quality and reality of an image. However, it is labor-consuming to verify the correctness of semantic consistency in the auto-generation, which requires a comprehensive understanding and mapping of different modalities. The results of existing models are sorted and displayed largely relying on the global visualtext similarity. However, this coarse-grained approach does not capture the finegrained semantic alignment between image regions and text spans. To address this issue, we first present a new method to evaluate the cross-modal consistency by inspecting the decomposed semantic concepts. We then introduce a new metric, called MIS-Score, which is designed to measure the fine-grained semantic alignment between a prompt and its generation quantitatively. Moreover, we have also developed an automated robustness testing technique with referential transforms to test and measure the robustness of multi-modal synthesis models. We have conducted comprehensive experiments to evaluate the performance of recent popular models for text-to-image generation. Our study demonstrates that the proposed metric MIS-Score represents better evaluation criteria than existing coarse-grained ones (e.g., CLIP) to understand the semantic consistency of the synthesized results. Our robustness testing method also proves the existence of biases embedded in the models, hence uncovering their limitations in real applications. A car besides a blue cube CLIP Score MIS Score



Figure 1: The generated images of state-of-the-art multi-modal image synthesis models (DALL-E (Cho et al., 2022), Composable Diffusion (Liu et al., 2022), Stable Diffusion (Rombach et al., 2022)) selected by CLIP score and MIS-Score for two prompts: (i) A cat lying in a bucket. (ii) A car besides a blue cube. The green and orange bounding boxes in the prompts and the images indicate the fine-grained alignment of semantic concepts.

