SPATIAL REASONING NETWORK FOR ZERO-SHOT CONSTRAINED SCENE GENERATION Anonymous

Abstract

Constrained scene generation (CSG) generates images satisfying a given set of constraints. Zero-shot CSG generates images satisfying constraints not presented in the training set without retraining. Recent neural-based models generate images with excellent details, but largely cannot satisfy constraints, especially in complex scenes involving multiple objects. Such difficulty is due to the lack of effective approaches combining low-level visual element generation with high-level spatial reasoning. We introduce a Spatial Reasoning Network for constrained scene generation (SPREN). SPREN adds to the state-of-the-art image generation networks (for low-level visual element generation) a spatial reasoning module (for high-level spatial reasoning). The spatial reasoning module decides objects' positions following the output of a Recursive Neural Network (RNN), which is trained to learn implicit spatial knowledge, such as trees grow from the ground. During inference, explicit constraints can be enforced by a forward-checking algorithm, which blocks invalid decisions from the RNN in a zero-shot manner. In experiments, we demonstrate SPREN is able to generate images with excellent details while satisfying complex spatial constraints. SPREN also transfers good quality scene generation to unseen constraints without retraining.

1. INTRODUCTION

Constrained content generation has long been an important task in artificial intelligence and has many implications across domains Nauata et al. (2020) ; Jiang et al. (2021) ; Ma et al. (2021) . This paper focuses on constrained scene generation -generating a realistic scene image containing multiple objects satisfying a given set of constraints. While there has been exciting progress in scene generation, especially using deep generative models Deng et al. (2021); Liu et al. (2021); Arad Hudson & Zitnick (2021) , generating scenes involving multiple objects satisfying complex spatial relationships remains a challenging task. Existing approaches often cannot generate scenes which contain the right number of objects or the correct spatial relationship between the objects (according to user-defined constraints). We hypothesize such difficulty is due to the lack of effective approaches combining low-level visual element generation with high-level spatial reasoning. From psychophysiological studies Kahneman (2011); Sowden et al. (2015) ; Lin & Lien (2013), human beings require multiple systems of reasoning and memory (including systems 1 and 2) to complete complex content generation tasks. Procedural (P) cognition retains the skill to generate the texture/shape of standalone objects. System 1 (S1) cognition captures "common-sense" knowledge and patterns. For example, trees are on the ground, but birds are in the sky. System 2 (S2) cognition embodies reasoning about the high level task, planning the content of the image and enforcing explicit constraints at an abstract level. Over the years, neural generative models have been very successful in learning tasks associated with P and S1-cognition, but fail consistently on S2-cognition -especially enforcing complex constraints. Traditional constraint reasoning methods are able to provide the S2-cognition necessary for our task, but they are too rigid to handle P and S1-cognition. We introduce a Spatial Reasoning Network for constrained scene generation (SPREN) . The key idea is to add to the state-of-the-art neural generative models responsible for low-level visual element generation, or P/S1 cognition, a spatial reasoning module, which handles high-level spatial reasoning, or S2 cognition. The input of constrained scene generation is a background image and a set of spatial constraints represented in propositional logic, and the output is the generated image containing objects satisfying constraints. The spatial reasoning module decides the objects' positions following the output of a Recursive Neural Network (RNN) in a process of iterative refinement. The RNN outputs the bounding boxes (we call them blueprints) for each object to be generated. When determining one coordinate of the bounding box, the RNN iteratively halves the range of the coordinate until it is sufficiently small. During learning, the RNN is trained to understand implicit spatial knowledge, such as trees grow from the ground and birds fly in the sky. This is done by a teacher forcing procedure which matches the bounding boxes predicted by the RNN and the ones containing the objects in the original image. During inference, explicit constraints can be enforced by a forward-checking algorithm, which blocks the decisions leading to constraint violations. The forward checking procedure also allows us to handle constraints in a zero-shot manner. During test time when novel constraints are present, the forward checking procedure blocks the output of the RNNs following the same procedure handling familiar constraints, without any retraining or fine-tuning. Figure 1 In experiments, we demonstrate SPREN is able to generate images with excellent details while satisfying complex spatial constraints. We also show that SPREN works well for object aware scene generation, which is an inpainting task involving adding additional objects to an image containing existing ones satisfying constraints involving both the existing and newly added objects. SPREN also works well in zero-shot transfer learning: it generates good-quality scenes involving constraints unseen from the training set without retraining or fine-tuning. Overall, our contributions are: • We introduce the SPREN framework for constrained scene generation, combining low-level visual element generation with high-level spatial reasoning. • The objects are positioned satisfying explicit constraints and fit into the visual context of the image well, thanks to the spatial reasoning module.



Figure 1: A scene generated by SPREN with a given background and six animals subject to complex positional constraints. The animals and the spatial constraints governing their spatial locations are shown in the upper left panel. The upper center panel shows the background image on which the scene must be placed. The upper right panel shows the high-level pipeline of our approach. The lower left panel shows the locations of these animals generated by SPREN (blueprints). The lower center panel shows a realistic image SPREN produces which satisfies constraints. The lower right panel shows a baseline scene for comparison, where constraints cannot be properly enforced.

demonstrates the generative procedure of SPREN. Here, the colored arrows in the upper left panel represent the spatial constraints restricting the animals to be generated. The upper center panel represents the input background image. The lower-left panel shows the blueprint (bounding boxes) output by the spatial reasoning module. The lower-middle panel shows the final output of SPREN and the lower-right a comparison with the previous state-of-the-art GLIDE model.

