SPATIAL REASONING AS OBJECT GRAPH ENERGY MINIMIZATION Anonymous

Abstract

We propose a model that maps spatial rearrangement instructions to goal scene configurations via gradient descent on a set of relational energy functions over object 2D overhead locations, one per spatial predicate in the instruction. Energy based models over object locations are trained from a handful of examples of object arrangements annotated with the corresponding spatial predicates. Predicates can be binary (e.g., left, right, etc.) or multi-ary (e.g., circles, lines, etc.). A language parser maps language instructions to the corresponding set of EBMs, and a visual-language model grounds its arguments on relevant objects in the visual scene. Energy minimization on the joint set of energies iteratively updates the object locations to generate goal configuration. Then, low-level policies relocate objects to the inferred goal locations. Our framework shows many forms of strong generalization: (i) joint energy minimization handles zero-shot complex predicate compositions while each EBM is trained only from single predicate instructions, (ii) the model can execute instructions zero-shot, without a need for paired instruction-action training, (iii) instructions can mention novel objects and attributes at test time thanks to the pre-training of the visual language grounding model from large scale passive datasets. We test the model in established instruction-guided manipulation benchmarks, as well as a benchmark of compositional instructions we introduce in this work. We show large improvements over state-of-the-art end-to-end language to action policies and planning in large language models, especially for long instructions and multi-ary spatial concepts.

1. INTRODUCTION

Rearranging objects to semantically meaningful configurations has many practical applications for domestic and warehouse robots (Cakmak & Takayama, 2013) . In this work, we focus on the problem of semantic rearrangement depicted in Figure 1 . The input is a visual scene and a language instruction. The robot is tasked with rearranging the objects to their instructed configurations. End-to-end language to action mapping Many works in semantic scene rearrangement have made progress by mapping language instructions directly to actions (Shridhar et al., 2021; Liu et al., 2021c; Janner et al., 2018; Bisk et al., 2017) or object locations (Mees et al., 2020; Gubbi et al., 2020; Stengel-Eskin et al., 2022) . Many recent end-to-end language to action or language to object locations approaches use transformer architectures to fuse visual, language and action streams (Shridhar et al., 2022; Liu et al., 2021c; Pashevich et al., 2021) . Despite their generality and their impressive results within the training distribution, these methods typically do not show generalization to different task configurations, for example, longer instructions, new objects present in the scene, novel backgrounds, or combinations of the above (Liu et al., 2021c; Shridhar et al., 2021) . Furthermore, these methods cannot easily determine when the task has been completed and they should terminate (Shridhar et al., 2021; Liang et al., 2022) since they do not model explicitly the goal scene configuration to be achieved.

Symbolic planners and learning to plan

To handle the challenges of reactive mapping of language to actions, some methods use look-ahead search to infer a sequence of actions or object rearrangements that eventually achieves the goal implied by the instruction (Prabhudesai et al., 2019) . Look-ahead search for scene re-arrangement requires dynamics models that can handle complex et al., 2021; Hamrick et al., 2020) . As a result, planning for scene re-arrangement is dominated by symbolic planners, such as Planning Domain Definition Language (PDDL) planners (Migimatsu & Bohg, 2019; Kaelbling & Lozano-Pérez, 2011; Toussaint, 2015; Lyu et al., 2018) . Symbolic planners assume that each state of the world, scene final goal states, and intermediate subgoal states can be sufficiently represented in a logical form, using language predicates that describe object spatial relations. These methods rely on manually specified symbolic transition rules and planning domains. They use object detectors for state estimation (Kase et al., 2020) and interface with low-level controllers for object manipulation. Many recent methods use learning to guide the combinatorial symbolic search by predicting directly semantic subgoals, conditioned on the instruction and the visual scene (Xu et al., 2019; Zhu et al., 2020; Drieß et al., 2020) . Symbolic state bottlenecks used in these methods usually limits them to modelling pairwise spatial predicates. Spatial configurations that involve multi-ary relations between objects, such as circles, lines, squares, etc., are not easy to achieve because the long sequence of actions needed to put together such a shape configuration is outsourced to the low-level policy since intermediate states cannot be easily described in symbolic form (Figure -1 ). Under such long action horizon, the low-level policy often fails. Large language models for spatial planning Recent methods wish to capitalize on knowledge of large language models (LLMs) (Brown et al., 2020; Liu et al., 2021b) for spatial planning (Huang et al., 2022a) and instruction following (Huang et al., 2022b; Ahn et al., 2022) . Works of (Huang et al., 2022a) showed that LLMs can predict relevant language subgoals directly from the instruction and a symbolic scene description when conditioned on appropriate prompts, without any additional training (Huang et al., 2022b; Ahn et al., 2022; Huang et al., 2022a) . The scene description is provided by an oracle (Huang et al., 2022b) or by object detectors (Huang et al., 2022b; Chen et al., 2022) as a list of objects. Predicted language subgoals, e.g., "pick up the can" are directly fed to lowlevel policies, or engineered controllers (Huang et al., 2022b) . Work of (Liang et al., 2022) builds upon LLMs for code generation and predicts actions in the form of policy programs. It capitalizes on knowledge of LLMs for code generation regarding various functions necessary for semantic parsing of diverse language instructions. For the input programmatic scene description it uses lists of objects alongside their spatial box coordinates, missing from previous works. The language model predicts the actions in terms of low level program functions that interfaces with low level manipulation skills and object detection routines. The hope is that with appropriate prompting the LLM will draw from its vast implicit knowledge of code functions associated with natural language, provided in the comments of programs it has seen, to generate programmatic code that solves novel task instances of scene rearrangement. As highlighted in both (Liang et al., 2022; Huang et al., 2022b) , prompting dramatically affects the model's performance and its ability to predict reasonable subgoals, and programmatic code, respectively. In this paper, we question whether the language space is the most efficient means to reason about objects and their spatial arrangements.

Spatial reasoning as graph energy minimization

In this paper, we introduce Spatial Planning as multi Graph Energy Minimization (SPGEM), a framework for spatial reasoning for instruction following via compositional energy minimization over sub-graphs of object entities, one per spatial predicate in the instruction, as shown in Figure 1 . We represent each spatial "predicate" as a binary or



Figure 1: Scene rearrangement as object graph energy minimization.

