CONTROLLABLE IMAGE GENERATION VIA COLLAGE REPRESENTATIONS

Abstract

Recent advances in conditional generative image models have enabled impressive results. On the one hand, text-based conditional models have achieved remarkable generation quality, by leveraging large-scale datasets of image-text pairs. To enable fine-grained controllability, however, text-based models require long prompts, whose details may be ignored by the model. On the other hand, layout-based conditional models have also witnessed significant advances. These models rely on bounding boxes or segmentation maps for precise spatial conditioning in combination with coarse semantic labels. The semantic labels, however, cannot be used to express detailed appearance characteristics. In this paper, we approach fine-grained scene controllability through image collages which allow a rich visual description of the desired scene as well as the appearance and location of the objects therein, without the need of class nor attribute labels. We introduce "mixing and matching scenes" (M&Ms), an approach that consists of an adversarially trained generative image model which is conditioned on appearance features and spatial positions of the different elements in a collage, and integrates these into a coherent image. We train our model on the OpenImages (OI) dataset and evaluate it on collages derived from OI and MS-COCO datasets. Our experiments on the OI dataset show that M&Ms outperforms baselines in terms of fine-grained scene controllability while being very competitive in terms of image quality and sample diversity. On the MS-COCO dataset, we highlight the generalization ability of our model by outperforming DALL-E in terms of the zero-shot FID metric, despite using two magnitudes fewer parameters and data. Collage based generative models have the potential to advance content creation in an efficient and effective way as they are intuitive to use and yield high quality generations.

1. INTRODUCTION

Controllable image generation leverages user inputs -e.g. textual descriptions, scene graphs, bounding box layouts, or segmentation masks -to guide the creative process of composing novel scenes. Textbased conditionings offer an intuitive mechanism to control content creation, and short and potentially high level descriptions can result in high quality generations (Ding et al., 2021; Gafni et al., 2022; Nichol et al., 2022; Ramesh et al., 2021; Reed et al., 2016; Rombach et al., 2022) . However, to describe complex scenes in detail, long text prompts are required, which are challenging for current models, see e.g. the person's position in the second row of Figure 1 . Moreover, current text-based models require very large training datasets composed of tens of millions of data points to obtain satisfactory performance levels. Bounding boxes (BB) (Sun & Wu, 2019; 2020; Sylvain et al., 2021; Zhao et al., 2019) , scene graphs (Ashual & Wolf, 2019) and segmentation mask (Chen & Koltun, 2017; Liu et al., 2019; Park et al., 2019; Qi et al., 2018; Schönfeld et al., 2021; Tang et al., 2020b; Wang et al., 2018; 2021) conditionings offer strong spatial and class-level semantic control, but offer no control over the appearance of scene elements beyond the class level. Although user interaction is still rather intuitive, the diversity of the generated scenes is often limited, see e.g. the third and fourth rows of Figure 1 , and the annotations required to train the models are laborious to obtain. Moreover, the generalization ability of these approaches is restricted by the classes and scene compositions appearing in the training set (Casanova et al., 2020) . In this paper, we explore fine-grained scene generation controllability by leveraging image collages to condition the model. As the saying goes, a picture is worth a thousand words, and therefore, 

BB-to-image

Mask-to-image the rich information contained in image collages has the potential to effectively guide the scene generation process, see the first row of Figure 1 . Collage-based conditionings can be easily created from a set of images with minimal user interaction, and provide a detailed visual description of the scene appearance and composition. Moreover, as image collages by construction do not require any semantic labels, leveraging them holds the promise of benefiting from very large and easy-to-obtain datasets to improve the resulting image quality. To enable collage-based scene controllability, we introduce Mixing & Matching scenes (M&Ms), an approach that extends the instance conditioned GAN (IC-GAN, (Casanova et al., 2021) ) by leveraging image collages and treating each element of the collage as a separate instance. In particular, M&Ms takes as input an image collage, extracts representations of each one of its elements, and spatially arranges these representations to generate high quality images that are similar to the input collage. M&Ms is composed of a pre-trained feature extractor, a generator and two discriminators, operating at image and object level respectively. Similar to IC-GAN, M&Ms leverages the neighbors of the instances in the feature space to model local densities but extends the framework to blend multiple localized distributions -one per collage element -into coherent images. We train M&Ms on the OpenImages dataset (Kuznetsova et al., 2020) and validate it using collages derived from OpenImages



the mountains with a forest in the background, with a person in sport clothes on the bottom-left corner of the image and a wooden house on the top-right corner of the image.

Figure 1: Approaches to controllable scene generation (from top): our approach based on collages (M&Ms), text-to-image model (Make-a-scene (Gafni et al., 2022), samples are a courtesy of the paper authors), BB-to-image model (LostGANv2 (Sun & Wu, 2020)), and Mask-to-image (GauGAN2 (Parket al., 2019)). Note that the models have been trained on different datasets and, as such, GauGAN2 cannot generate people as the model was not trained on a dataset containing this class. Visualization of the input collage to M&Ms is depicted as a collaged RGB image for simplicity.

