MULTIMODAL ATTENTION FOR LAYOUT SYNTHESIS IN DIVERSE DOMAINS Anonymous

Abstract

We address the problem of scene layout generation for diverse domains such as images, mobile applications, documents and 3D objects. Most complex scenes, natural or human-designed, can be expressed as a meaningful arrangement of simpler compositional graphical primitives. Generating a new layout or extending an existing layout requires understanding the relationships between these primitives. To do this, we propose a multimodal attention framework, MMA, that leverages self-attention to learn contextual relationships between layout elements and generate novel layouts in a given domain. Our framework allows us to generate a new layout either from an empty set or from an initial seed set of primitives, and can easily scale to support an arbitrary of primitives per layout. Further, our analyses show that the model is able to automatically capture the semantic properties of the primitives. We propose simple improvements in both representation of layout primitives, as well as training methods to demonstrate competitive performance in very diverse data domains such as object bounding boxes in natural images (COCO bounding boxes), documents (PubLayNet), mobile applications (RICO dataset) as well as 3D shapes (PartNet).

1. INTRODUCTION

In the real world, there exists a strong relationship between different objects that are found in the same environment (Torralba & Sinha, 2001; Shrivastava & Gupta, 2016) . For example, a dining table usually has chairs around it, a surfboard is found near the sea, horses do not ride cars, etc.. Biederman (2017) provided strong evidence in cognitive neuroscience that perceiving and understanding a scene involves two related processes: perception and comprehension. Perception deals with processing the visual signal or the appearance of a scene. Comprehension deals with understanding the schema of a scene, where this schema (or layout) can be characterized by contextual relationships (e.g., support, occlusion, and relative likelihood, position, and size) between objects. For generative models that synthesize scenes, this evidence underpins the importance of two factors that contribute to the realism or plausibility of a generated scene: layout, i.e., the arrangement of different objects, and their appearance (in terms of pixels). Therefore, generating a realistic scene necessitates both these factors to be plausible. The advancements in the generative models for image synthesis have primarily targeted plausibility of the appearance signal by generating incredibly realistic images often with a single entity such as faces (Karras et al., 2019; 2017 ), or animals (Brock et al., 2018;; Zhang et al., 2018) . In the case of large and complex scenes, with a lot of strong non-local relationships between different elements, most methods require proxy representations for layouts to be provided as inputs (e.g., scene graph, segmentation mask, sentence). We argue that to plausibly generate large and complex scenes without such proxies, it is necessary to understand and generate the layout of a scene, in terms of contextual relationships between various objects present in the scene. The layout of a scene, capturing what primitives occupy what parts of the scene, is an incredibly rich representation. Learning to generate layouts itself is a challenging problem due to the variability of real-world or human-designed layouts. Each layout is composed of a small fraction of possible objects, objects can be present in a wide range of locations, the number of objects varies for each scene and so do the contextual relationships between objects. Formally, a scene layout can be represented as an unordered set of graphical primitives. The primitive itself can be discrete or continuous depending on the data domain. For example, in the case of layout of documents, primitives can be bounding boxes from discrete classes such as 'text', 'image', or 'caption', and in case of 3D objects, primitives can be 3D occupancy grids of parts of the object such as 'arm', 'leg', or 'back' in case of chairs. Additionally, in order to make the primitives compositional, we represent each primitive by a location vector with respect to the origin, and a scale vector that defines the bounding box enclosing the primitive. Again, based on the domain, these location and scale vectors can be 2D or 3D. A generative model for layouts should be able to look at all existing primitives and propose the placement and attributes of a new one. We propose a novel Multimodal Attention framework (MMA) that first maps the different parameters of the primitive independently to a fixed-length continuous latent vector, followed by a masked Transformer decoder to look at representations of existing primitives in layout and predict the next parameter. Our generative framework can start from an empty set, or a set of primitives, and can iteratively generate a new primitive one parameter at a time. Moreover, by predicting either to stop or to generate the next primitive, our sequential approach can generate variable length layouts. Our approach can be readily plugged into scene generation frameworks (e.g., Layout2Image (Zhao et al., 2019) , GauGAN (Park et al., 2019b) ) or stand-alone applications that require generating layouts or templates with/without user interaction. For instance, in the UI design of mobile apps and websites, an automated model for generating plausible layouts can significantly decrease the manual effort and cost of building such apps and websites. Finally, a model to create layouts can potentially help generate synthetic data for various tasks tasks (Yang et al., 2017; Capobianco & Marinai, 2017; Chang et al., 2015; Wu et al., 2017b; a) . To the best of our knowledge, MMA is the first framework to perform competitively with the stateof-the-art approaches in 4 diverse data domains. We evaluate our model using existing metrics proposed for different domains such as Jensen-Shannon Divergence, Minimum matching distance, and Coverage in case of 3D objects, Inception Score and Fréchet Inception Distance for COCO, and Negative Log-likelihood of the test set in case of app wireframes and documents. Qualitative analysis of the framework also demonstrates that our model captures the semantic relationships between objects automatically (without explicitly using semantic embeddings like word2vec Mikolov et al. (2013) ).

2. RELATED WORK

Generative models. Deep generative models based on CNNs such as variational auto-encoders (VAEs) (Kingma & Welling, 2013) , and generative adversarial networks (GANs) (Goodfellow et al., 2014) have recently shown a great promise in terms of faithfully learning a given data distribution and sampling from it. There has also been research on generating data sequentially (Oord et al., 2016; Chen et al., 2020) even when the data has no natural order (Vinyals et al., 2015) . Many of these approaches often rely on low-level information (Gupta et al., 2020b) such as pixels while generating images (Brock et al., 2018; Karras et al., 2019 ), videos (Vondrick et al., 2016 ), or 3D objects (Wu et al., 2016; Yang et al., 2019; Park et al., 2019a; Gupta et al., 2020a) and not on semantic and geometric structure in the data. Scene generation. Generating 2D or 3D scenes conditioned on sentence (Li et al., 2019d; Zhang et al., 2017; Reed et al., 2016) , a scene graph (Johnson et al., 2018; Li et al., 2019a; Ashual & Wolf, 2019 ), a layout (Dong et al., 2017; Hinz et al., 2019; Isola et al., 2016; Wang et al., 2018b) or an existing image (Lee et al., 2018) has drawn a great interest in vision community. Given the input, some works generate a fixed layout and diverse scenes (Zhao et al., 2019) , while other works generate diverse layouts and scenes (Johnson et al., 2018; Li et al., 2019d) . These methods involve pipelines often trained and evaluated end-to-end, and surprisingly little work has been done to evaluate the layout generation component itself. Layout generation serves as a complementary task to these works and can be combined with these methods. In this work, we evaluate the layout modeling



Figure 1: Our framework can synthesize layouts in diverse natural as well as human designed data domains such as natural scenes or 3D objects in a sequential manner.

