COFS CONTROLLABLE FURNITURE LAYOUT SYNTHESIS

Abstract

Realistic, scalable, and controllable generation of furniture layouts is essential for many applications in virtual reality, augmented reality, game development and synthetic data generation. The most successful current methods tackle this problem as a sequence generation problem which imposes a specific ordering on the elements of the layout, making it hard to exert fine-grained control over the attributes of a generated scene. Existing methods provide control through objectlevel conditioning, or scene completion, where generation can be conditioned on an arbitrary subset of furniture objects. However, attribute-level conditioning, where generation can be conditioned on an arbitrary subset of object attributes, is not supported. We propose COFS, a method to generate furniture layouts that enables fine-grained control through attribute-level conditioning. For example, COFS allows specifying only the scale and type of objects that should be placed in the scene and the generator chooses their positions and orientations; or the position that should be occupied by objects can be specified and the generator chooses their type, scale, orientation, etc. Our results show both qualitatively and quantitatively that we significantly outperform existing methods on attribute-level conditioning.

1. INTRODUCTION

Automatic generation of realistic assets enables content creation at a scale that is not possible with traditional manual workflows. It is driven by the growing demand for virtual assets in both the creative industries, virtual worlds, and increasingly data-hungry deep model training. In the context of automatic asset generation, 3D scene and layout generation plays a central role as much of the demand is for the types of real-world scenes we see and interact with every day, such as building interiors. Deep generative models for assets like images, videos, 3D shapes, and 3D scenes have come a long way to meet this demand. In the context of 3D scene and layout modeling, in particular auto-regressive models based on transformers enjoy great success. Inspired by language modeling, these architectures treat layouts as sequences of tokens that are generated one after the other and typically represent attributes of furniture objects, such as the type, position, or scale of an object. These architectures are particularly well suited for modeling spatial relationships between elements of a layout. For example, (Para et al., 2021) generate two-dimensional interior layouts with two transformers, one for furniture objects and one for spatial constraints between these objects, while SceneFormer (Wang et al., 2021) and ATISS (Paschalidou et al., 2021) extend interior layout generation to 3D. A key limitation of a basic autoregressive approach is that it only provides limited control over the generated scene. It enforces a sequential generation order, where new tokens can only be conditioned on previously generated tokens and in addition it requires a consistent ordering of the token sequence. This precludes both object-level conditioning, where generation is conditioned on a partial scene, e.g., an arbitrary subset of furniture objects, and attribute-level conditioning, where generation is conditioned on an arbitrary subset of attributes of the furniture objects, e.g., class or position of target objects. Most recently, ATISS (Paschalidou et al., 2021) partially alleviates this problem by randomly permuting furniture objects during training, effectively enabling object-level conditioning. However, attribute-level conditioning still remains elusive. We aim to improve on these results by enabling attribute-level conditioning, in addition to object-level conditioning. For example, a user might be interested to ask for a room with a table and two chairs, without specifying exactly where these objects should be located. Another example is to perform

