SPATIAL REASONING AS OBJECT GRAPH ENERGY MINIMIZATION Anonymous

Abstract

We propose a model that maps spatial rearrangement instructions to goal scene configurations via gradient descent on a set of relational energy functions over object 2D overhead locations, one per spatial predicate in the instruction. Energy based models over object locations are trained from a handful of examples of object arrangements annotated with the corresponding spatial predicates. Predicates can be binary (e.g., left, right, etc.) or multi-ary (e.g., circles, lines, etc.). A language parser maps language instructions to the corresponding set of EBMs, and a visual-language model grounds its arguments on relevant objects in the visual scene. Energy minimization on the joint set of energies iteratively updates the object locations to generate goal configuration. Then, low-level policies relocate objects to the inferred goal locations. Our framework shows many forms of strong generalization: (i) joint energy minimization handles zero-shot complex predicate compositions while each EBM is trained only from single predicate instructions, (ii) the model can execute instructions zero-shot, without a need for paired instruction-action training, (iii) instructions can mention novel objects and attributes at test time thanks to the pre-training of the visual language grounding model from large scale passive datasets. We test the model in established instruction-guided manipulation benchmarks, as well as a benchmark of compositional instructions we introduce in this work. We show large improvements over state-of-the-art end-to-end language to action policies and planning in large language models, especially for long instructions and multi-ary spatial concepts.

1. INTRODUCTION

Rearranging objects to semantically meaningful configurations has many practical applications for domestic and warehouse robots (Cakmak & Takayama, 2013) . In this work, we focus on the problem of semantic rearrangement depicted in Figure 1 . The input is a visual scene and a language instruction. The robot is tasked with rearranging the objects to their instructed configurations. End-to-end language to action mapping Many works in semantic scene rearrangement have made progress by mapping language instructions directly to actions (Shridhar et al., 2021; Liu et al., 2021c; Janner et al., 2018; Bisk et al., 2017) or object locations (Mees et al., 2020; Gubbi et al., 2020; Stengel-Eskin et al., 2022) . Many recent end-to-end language to action or language to object locations approaches use transformer architectures to fuse visual, language and action streams (Shridhar et al., 2022; Liu et al., 2021c; Pashevich et al., 2021) . Despite their generality and their impressive results within the training distribution, these methods typically do not show generalization to different task configurations, for example, longer instructions, new objects present in the scene, novel backgrounds, or combinations of the above (Liu et al., 2021c; Shridhar et al., 2021) . Furthermore, these methods cannot easily determine when the task has been completed and they should terminate (Shridhar et al., 2021; Liang et al., 2022) since they do not model explicitly the goal scene configuration to be achieved.

Symbolic planners and learning to plan

To handle the challenges of reactive mapping of language to actions, some methods use look-ahead search to infer a sequence of actions or object rearrangements that eventually achieves the goal implied by the instruction (Prabhudesai et al., 2019) . Look-ahead search for scene re-arrangement requires dynamics models that can handle complex

