SPATIAL REASONING NETWORK FOR ZERO-SHOT CONSTRAINED SCENE GENERATION Anonymous

Abstract

Constrained scene generation (CSG) generates images satisfying a given set of constraints. Zero-shot CSG generates images satisfying constraints not presented in the training set without retraining. Recent neural-based models generate images with excellent details, but largely cannot satisfy constraints, especially in complex scenes involving multiple objects. Such difficulty is due to the lack of effective approaches combining low-level visual element generation with high-level spatial reasoning. We introduce a Spatial Reasoning Network for constrained scene generation (SPREN). SPREN adds to the state-of-the-art image generation networks (for low-level visual element generation) a spatial reasoning module (for high-level spatial reasoning). The spatial reasoning module decides objects' positions following the output of a Recursive Neural Network (RNN), which is trained to learn implicit spatial knowledge, such as trees grow from the ground. During inference, explicit constraints can be enforced by a forward-checking algorithm, which blocks invalid decisions from the RNN in a zero-shot manner. In experiments, we demonstrate SPREN is able to generate images with excellent details while satisfying complex spatial constraints. SPREN also transfers good quality scene generation to unseen constraints without retraining.

1. INTRODUCTION

Constrained content generation has long been an important task in artificial intelligence and has many implications across domains Nauata et al. (2020) ; Jiang et al. (2021) ; Ma et al. (2021) . This paper focuses on constrained scene generation -generating a realistic scene image containing multiple objects satisfying a given set of constraints. While there has been exciting progress in scene generation, especially using deep generative models Deng et al. (2021); Liu et al. (2021) ; Arad Hudson & Zitnick (2021) , generating scenes involving multiple objects satisfying complex spatial relationships remains a challenging task. Existing approaches often cannot generate scenes which contain the right number of objects or the correct spatial relationship between the objects (according to user-defined constraints). We hypothesize such difficulty is due to the lack of effective approaches combining low-level visual element generation with high-level spatial reasoning. From psychophysiological studies Kahneman (2011); Sowden et al. (2015) ; Lin & Lien (2013), human beings require multiple systems of reasoning and memory (including systems 1 and 2) to complete complex content generation tasks. Procedural (P) cognition retains the skill to generate the texture/shape of standalone objects. System 1 (S1) cognition captures "common-sense" knowledge and patterns. For example, trees are on the ground, but birds are in the sky. System 2 (S2) cognition embodies reasoning about the high level task, planning the content of the image and enforcing explicit constraints at an abstract level. Over the years, neural generative models have been very successful in learning tasks associated with P and S1-cognition, but fail consistently on S2-cognition -especially enforcing complex constraints. Traditional constraint reasoning methods are able to provide the S2-cognition necessary for our task, but they are too rigid to handle P and S1-cognition. We introduce a Spatial Reasoning Network for constrained scene generation (SPREN). The key idea is to add to the state-of-the-art neural generative models responsible for low-level visual element generation, or P/S1 cognition, a spatial reasoning module, which handles high-level spatial reasoning, or S2 cognition. The input of constrained scene generation is a background image and

