CONTROLLABLE IMAGE GENERATION VIA COLLAGE REPRESENTATIONS

Abstract

Recent advances in conditional generative image models have enabled impressive results. On the one hand, text-based conditional models have achieved remarkable generation quality, by leveraging large-scale datasets of image-text pairs. To enable fine-grained controllability, however, text-based models require long prompts, whose details may be ignored by the model. On the other hand, layout-based conditional models have also witnessed significant advances. These models rely on bounding boxes or segmentation maps for precise spatial conditioning in combination with coarse semantic labels. The semantic labels, however, cannot be used to express detailed appearance characteristics. In this paper, we approach fine-grained scene controllability through image collages which allow a rich visual description of the desired scene as well as the appearance and location of the objects therein, without the need of class nor attribute labels. We introduce "mixing and matching scenes" (M&Ms), an approach that consists of an adversarially trained generative image model which is conditioned on appearance features and spatial positions of the different elements in a collage, and integrates these into a coherent image. We train our model on the OpenImages (OI) dataset and evaluate it on collages derived from OI and MS-COCO datasets. Our experiments on the OI dataset show that M&Ms outperforms baselines in terms of fine-grained scene controllability while being very competitive in terms of image quality and sample diversity. On the MS-COCO dataset, we highlight the generalization ability of our model by outperforming DALL-E in terms of the zero-shot FID metric, despite using two magnitudes fewer parameters and data. Collage based generative models have the potential to advance content creation in an efficient and effective way as they are intuitive to use and yield high quality generations.

1. INTRODUCTION

Controllable image generation leverages user inputs -e.g. textual descriptions, scene graphs, bounding box layouts, or segmentation masks -to guide the creative process of composing novel scenes. Textbased conditionings offer an intuitive mechanism to control content creation, and short and potentially high level descriptions can result in high quality generations (Ding et al., 2021; Gafni et al., 2022; Nichol et al., 2022; Ramesh et al., 2021; Reed et al., 2016; Rombach et al., 2022) . However, to describe complex scenes in detail, long text prompts are required, which are challenging for current models, see e.g. the person's position in the second row of Figure 1 . Moreover, current text-based models require very large training datasets composed of tens of millions of data points to obtain satisfactory performance levels. Bounding boxes (BB) (Sun & Wu, 2019; 2020; Sylvain et al., 2021; Zhao et al., 2019) , scene graphs (Ashual & Wolf, 2019) and segmentation mask (Chen & Koltun, 2017; Liu et al., 2019; Park et al., 2019; Qi et al., 2018; Schönfeld et al., 2021; Tang et al., 2020b; Wang et al., 2018; 2021) conditionings offer strong spatial and class-level semantic control, but offer no control over the appearance of scene elements beyond the class level. Although user interaction is still rather intuitive, the diversity of the generated scenes is often limited, see e.g. the third and fourth rows of Figure 1 , and the annotations required to train the models are laborious to obtain. Moreover, the generalization ability of these approaches is restricted by the classes and scene compositions appearing in the training set (Casanova et al., 2020) . In this paper, we explore fine-grained scene generation controllability by leveraging image collages to condition the model. As the saying goes, a picture is worth a thousand words, and therefore,

