COFS CONTROLLABLE FURNITURE LAYOUT SYNTHESIS

Abstract

Realistic, scalable, and controllable generation of furniture layouts is essential for many applications in virtual reality, augmented reality, game development and synthetic data generation. The most successful current methods tackle this problem as a sequence generation problem which imposes a specific ordering on the elements of the layout, making it hard to exert fine-grained control over the attributes of a generated scene. Existing methods provide control through objectlevel conditioning, or scene completion, where generation can be conditioned on an arbitrary subset of furniture objects. However, attribute-level conditioning, where generation can be conditioned on an arbitrary subset of object attributes, is not supported. We propose COFS, a method to generate furniture layouts that enables fine-grained control through attribute-level conditioning. For example, COFS allows specifying only the scale and type of objects that should be placed in the scene and the generator chooses their positions and orientations; or the position that should be occupied by objects can be specified and the generator chooses their type, scale, orientation, etc. Our results show both qualitatively and quantitatively that we significantly outperform existing methods on attribute-level conditioning.

1. INTRODUCTION

Automatic generation of realistic assets enables content creation at a scale that is not possible with traditional manual workflows. It is driven by the growing demand for virtual assets in both the creative industries, virtual worlds, and increasingly data-hungry deep model training. In the context of automatic asset generation, 3D scene and layout generation plays a central role as much of the demand is for the types of real-world scenes we see and interact with every day, such as building interiors. Deep generative models for assets like images, videos, 3D shapes, and 3D scenes have come a long way to meet this demand. In the context of 3D scene and layout modeling, in particular auto-regressive models based on transformers enjoy great success. Inspired by language modeling, these architectures treat layouts as sequences of tokens that are generated one after the other and typically represent attributes of furniture objects, such as the type, position, or scale of an object. These architectures are particularly well suited for modeling spatial relationships between elements of a layout. For example, (Para et al., 2021) generate two-dimensional interior layouts with two transformers, one for furniture objects and one for spatial constraints between these objects, while SceneFormer (Wang et al., 2021) and ATISS (Paschalidou et al., 2021) extend interior layout generation to 3D. A key limitation of a basic autoregressive approach is that it only provides limited control over the generated scene. It enforces a sequential generation order, where new tokens can only be conditioned on previously generated tokens and in addition it requires a consistent ordering of the token sequence. This precludes both object-level conditioning, where generation is conditioned on a partial scene, e.g., an arbitrary subset of furniture objects, and attribute-level conditioning, where generation is conditioned on an arbitrary subset of attributes of the furniture objects, e.g., class or position of target objects. Most recently, ATISS (Paschalidou et al., 2021) partially alleviates this problem by randomly permuting furniture objects during training, effectively enabling object-level conditioning. However, attribute-level conditioning still remains elusive. We aim to improve on these results by enabling attribute-level conditioning, in addition to object-level conditioning. For example, a user might be interested to ask for a room with a table and two chairs, without specifying exactly where these objects should be located. Another example is to perform object queries for given geometry attributes. The user could specify the location of an object and query the most likely class, orientation, and size of an object at the given location. Our model thereby extends the baseline ATISS with new functionality while retaining all its existing properties and performance. The main technical difficulty in achieving attribute-level conditioning is due to the autoregressive nature of the generative model. Tokens in the sequence that define a scene are generated iteratively, and each step only has information about the previously generated tokens. Thus, the condition can only be given at the start of the sequence, otherwise some generation steps will miss some of the conditioning information. The main idea of our work is to allow for attribute-level conditioning using two mechanisms: (i) Like ATISS, we train our generator to be approximately invariant to object permutations by randomly permuting furniture objects at training time. This enables object-level conditioning since an arbitrary subset of objects can be given as the start of the sequence. To condition on a partial set of object attributes however, the condition is not restricted to the start of the sequence. Attributes that are given as condition follow unconstrained attributes that need to be generated. (ii) To give our autoregressive model knowledge of the entire conditioning information in each step, we additionally use a transformer encoder that provides cross-attention over the complete conditioning information in each step. These two mechanisms allow us to accurately condition on arbitrary subsets of the token sequence, for example, only on tokens corresponding to specific object attributes. In our experiments, we demonstrate four applications: (i) attribute-level conditioning, (ii) attributelevel outlier detection, (iii) object-level conditioning, and (iv) unconditional generation. We compare to three current state-of-the-art layout generation methods (Ritchie et al., 2019; Wang et al., 2021; Paschalidou et al., 2021) and show performance that is on par or superior on unconditional generation and object-level conditioning, while also enabling attribute-level conditioning, which, to the best of our knowledge, is currently not supported by any existing layout generation method.

2. RELATED WORK

We discuss recent work that we draw inspiration from. In particular, we build on previous work in Indoor Scene Synthesis, Masked Language Models, and Set Transformers. Indoor Scene Synthesis: Before the rise of deep-learning methods, indoor scene synthesis methods relied on layout guidelines developed by skilled interior designers, and an optimzation strategy such that the adherence to those guidelines is maximized (Yu et al., 2011; Fisher et al., 2012; Weiss et al., 2019) . Such optimization is usually based on sampling methods like simulated annealing, MCMC, or rjMCMC. Deep learning based methods, e.g. (Wang et al., 2019; Ritchie et al., 2019; Wang et al., 2021; Paschalidou et al., 2021) are substantially faster and can better capture the variability of the design space. The state-of-the-art methods among them are autoregressive in nature. All of these operate on a top-down view of a partially generated scene. PlanIT and FastSynth then autoregressively generate the rest of the scene. FastSynth uses separate CNNs+MLPs to create probability distributions over location, size and orientation and categories. PlanIT on the other hand generates graphs where



Figure 1: Motivation. Current autoregressive layout generators (A) provide limited control over the generated result, since any generated value (denoted by black triangles) can only be conditioned on values that occur earlier in the sequence (values that are given as condition are denoted with c). Our proposed encoder-decoder architecture (B) adds bidirectional attention through an encoder, allowing the model to look ahead, so that all values in the sequence can be given as condition. This enables conditioning on an arbitrary subset of objects or object attributes in a layout. In C1, C2 only the position of an object, shown as pink cuboid, is given as condition and COFS performs context-aware generation of the remaining attributes. In D1, only object types are provided as condition, and D2 adds the bed orientation to the condition. Note how the layout adapts to fit the updated condition.

