TRAINING-FREE STRUCTURED DIFFUSION GUIDANCE FOR COMPOSITIONAL TEXT-TO-IMAGE SYNTHESIS

Abstract

Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. Attribute-binding requires the model to associate objects with the correct attribute descriptions, and compositional skills require the model to combine and generate multiple concepts into a single image. In this work, we improve these two aspects of T2I models to achieve more accurate image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, by manipulating the cross-attention representations based on linguistic insights, we can better preserve the compositional semantics in the generated image. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a significant 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process.

1. INTRODUCTION

Text-to-Image Synthesis (T2I) is to generate natural and faithful images given a text prompt as input. Recently, there has been a significant advancement in the quality of generated images by extremely large-scale vision-language models, such as DALL-E 2 (Ramesh et al., 2022) , Imagen (Saharia et al., 2022), and Parti (Yu et al., 2022) . In particular, Stable Diffusion (Rombach et al., 2022) is the state-of-the-art open-source implementation showing superior evaluation metric gains after training over billions of text-image pairs. In addition to generating high-fidelity images, the ability to compose multiple objects into a coherent scene is also essential. Given a text prompt from the user end, T2I models need to generate an image that contains all necessary visual concepts as mentioned in the text. Achieving such ability requires the model to understand both the full prompt and individual linguistic concepts from the prompt. As a result, the model should be able to combine multiple concepts and generate novel objects that have never been included in the training data. In this work, we mainly focus on improving the compositionality of the generation process, as it is essential to achieve controllable and generalized text-to-image synthesis with multiple objects in a complex scene. Attribute binding is a critical compositionality challenge (Ramesh et al., 2022; Saharia et al., 2022) to existing large-scale diffusion-based models. Despite the improvements in generating multiple objects in the same scene, existing models still fail when given a prompt such as "a brown bench in front of a white building" (see Fig. 1 ). The output images contains "a white bench" and "a brown building" instead, potentially due to strong training set bias or imprecise language understanding. From a practical perspective, explaining and solving such a two-object binding challenge is a primary step to understanding more complex prompts with multiple objects. Therefore, how to bind the attributes Even though state-of-the-art (SOTA) T2I models are trained on large-scale text-image datasets, they can still suffer from inaccurate results for simple prompts similar to the example above. Hence, we are motivated to seek an alternative, data-efficient method to improve the compositionality. We observe that the attribute-object relation pairs can be obtained as text spans for free from the parsing tree of the sentence. Therefore, we propose to combine the structured representations of prompts, such as a constituency tree or a scene graph, with the diffusion guidance process. Text spans only depict limited regions of the whole image. Conventionally, we need spatial information such as coordinates (Yang et al., 2022) as input to map their semantics into corresponding images. However, coordinate inputs cannot be interpreted by T2I models. Instead, we make use of the observations that attention maps provide free token-region associations in trained T2I models (Hertz et al., 2022) . By modifying the key-value pairs in cross-attention layers, we manage to map the encoding of each text span into attended regions in 2D image space. In this work, we discover similar observations in Stable Diffusion (Rombach et al., 2022) and utilize the property to build structured cross-attention guidance. Specifically, we use language parsers to obtain hierarchical structures from the prompts. We extract text spans across all levels, including visual concepts or entities, and encode them separately to disentangle the attribute-object pairs from each other. Compared to using a single sequence of text embedding for guidance, we improve the compositionality by multiple sequences where each emphasizes an entity or a union of entities from multiple hierarchies in the structured language representations. We refer to our method as Structured Diffusion Guidance (StructureDiffusion). Our contributions can be summarized as three-fold: • We propose an intuitive and effective method to improve compositional text-to-image synthesis by utilizing structured representations of language inputs. Our method is efficient and training-free that requires no additional training samples. • Experimental results show that our method achieves more accurate attribute binding and compositionality in the generated images. We also propose a benchmark named Attribute Binding Contrast set (ABC-6K) to measure the compositional skills of T2I models. • We conduct extensive experiments and analysis to identify the causes of incorrect attribute binding, which points out future directions in improving the faithfulness and compositionality of text-to-image synthesis.



Figure 1: Three challenging phenomena in the compositional generation. Attribute leakage: The attribute of one object is (partially) observable in another object. Interchanged attributes: the attributes of two or more objects are interchanged. Missing objects: one or more objects are missing. With slight abuse of attribute binding definitions, we aim to address all three problems in this work.

