HIERARCHICAL ABSTRACTION FOR COMBINATORIAL GENERALIZATION IN OBJECT REARRANGEMENT

Abstract

Object rearrangement is a challenge for embodied agents because solving these tasks requires generalizing across a combinatorially large set of configurations of entities and their locations. Worse, the representations of these entities are unknown and must be inferred from sensory percepts. We present a hierarchical abstraction approach to uncover these underlying entities and achieve combinatorial generalization from unstructured visual inputs. By constructing a factorized transition graph over clusters of entity representations inferred from pixels, we show how to learn a correspondence between intervening on states of entities in the agent's model and acting on objects in the environment. We use this correspondence to develop a method for control that generalizes to different numbers and configurations of objects, which outperforms current offline deep RL methods when evaluated on simulated rearrangement tasks.

1. INTRODUCTION

The first level groups visual features from sensorimotor interaction to produce transitions between sets of entities The second level groups entities that share the same state transition to produce a graph over entityagnostic state transitions The power of an abstraction depends on its usefulness for solving new problems. Object rearrangement (Batra et al., 2020) offers an intuitive setting for studying the problem of learning reusable abstractions. Solving novel rearrangement problems requires an agent to not only infer object representations without supervision, but also recognize that the same action for moving an object between two locations can be reused for different objects in different contexts. We study the simplest setting in simulation with pick-and-move action primitives that move one object at a time. Even such a simple setting is challenging because the space of object configurations is combinatorially large, resulting in long-horizon combinatorial task spaces. We formulate rearrangement as an offline goal-conditioned reinforcement learning (RL) problem, where the agent is pretrained on a experience buffer of sensorimotor interactions and is evaluated on producing actions for rearranging objects specified in the input image to satisfy constraints depicted in a goal image. Offline RL methods (Levine et al., 2020) that do not infer factorized representations of entities struggle to generalize to problems with more objects. But planning with object-centric methods that do infer entities (Veerapaneni et al., 2020) is also not easy because the difficulties of long-horizon planning with learned parametric models (Janner et al., 2019) are exacerbated in combinatorial spaces. Instead of planning with parametric models, our work takes inspiration from non-parametric planning methods that have shown success in combining neural networks with graph search to generate long-horizon plans. These methods (Yang et al., 2020; Zhang et al., 2018; Lippi et al., 2020; Emmons et al., 2020) explicitly construct a transition graph from the experience buffer and plan by searching through the actions recorded in this transition graph with a learned distance metric. The advantage of such approaches is the ability to stitch different path segments from offline data to solve new problems. The disadvantage is that the non-parametric nature of such methods requires transitions that will be used for solving new problems to have already been recorded in the buffer, making conventional methods, which store entire observations monolithically, ill-suited for combinatorial generalization. Fig. 2b shows that the same state transition can manifest for different objects and in different contexts, but monolithic non-parametric methods are not constrained to recognize that all scenarios exhibit the same state transition at an abstract level. This induces an blowup in the number of nodes of the search graph. To overcome this problem, we devise a method that explicitly exploits the similarity among state transitions in different contexts. Our method, Neural Constraint Satisfaction (NCS), marries the strengths of non-parametric planning with those of object-centric representations. Our main contribution is to show that factorizing the traditionally monolithic entity representation into action-invariant features (its type) and action-dependent features (its state) makes it possible during planning and control to reuse action representations for different objects in different contexts, thereby addressing the core combinatorial challenge in object rearrangement. To implement this factorization, NCS constructs a two-level hierarchy (Fig. 1 ) to abstract the experience buffer into a graph over state transitions of individual entities, separated from other contextual entities (Fig. 3 ). To solve new rearrangement problems, NCS infers what state transitions can be taken given the current and goal image observations, re-composes sequences of state transitions from the graph, and translates these transitions into actions. In §3 we introduce a problem formulation that exposes the combinatorial structure of object rearrangement tasks by explicitly modeling the independence, symmetry, and factorization of latent entities. This reveals two challenges in object rearrangement which we call the correspondence problem and combinatorial problem. In §4 we present NCS, a method for controlling an agent that plans over and acts with emergent learned entity representations, as a unified method for tackling both challenges. We show in §5 that NCS outperforms both state-of-the-art offline RL methods and object-centric shooting-based planning methods in simulated rearrangement problems.

2. RELATED WORK

The problem of discovering re-composable representations is generally motivated by combinatorial task spaces. The traditional approach to enforcing this compositional inductive bias is to compactly represent the task space with MDPs that human-defined abstractions of entities, such as factored MDPs Boutilier et al. (1995; 2000) ; (2015) . Approaches building off of such symbolic abstractions (Chang et al., 2016; Battaglia et al., 2018; Zadaianchuk et al., 2022; Bapst et al., 2019; Zhang et al., 2018) do not address the problem of how such entity abstractions arise from raw data. Our work is one of the first to learn compact representations of combinatorial task spaces directly from raw sensorimotor data. Recent object-centric methods (Greff et al., 2017; Van Steenkiste et al., 2018; Greff et al., 2019; 2020; Locatello et al., 2020a; Kipf et al., 2021; Zoran et al., 2021; Singh et al., 2021) do learn entity representations, as well as their transformations (Goyal et al., 2021; 2020) , from sensorimotor data, but only do so for modeling images and video, rather than for taking actions. Instead, we study how well entity-representations can reused for solving tasks. Kulkarni et al. ( 2019) considers how object representations improve exploration, but we consider the offline setting which requires zero-shot generalization. Veerapaneni et al. ( 2020) also considers on control tasks, but their shooting-based planning method in suffers from compounding errors as other learned single-step models do (Janner et al., 2019) , while our hierarchical non-parametric approach enables us to plan for longer horizons. Non-parametric approaches have recently become popular for long horizon planning (Yang et al., 2020; Zhang et al., 2018; Lippi et al., 2020; Emmons et al., 2020; Zhang et al., 2021) , but the drawback of these approaches is they represent the entire scenes monolithically, which causes a blowup of nodes in combinatorial task spaces, making it infeasible for these methods to be applied in rearrangement tasks that require generalizing to novel object configurations with different numbers of objects. Similar to our work, Huang et al. ( 2019) also tackles rearrangement problems by searching over a constructed latent task graph, but they require a demonstration during deployment time, whereas NCS does not because it reuses context-agnostic state transitions that were constructed during training. Zhang et al. ( 2021) conducts non-parametric planning directly on abstract subgoals rather than object-centric states -while similar, the downside of using subgoals rather than abstract states is that those subgoals are not used to represent equivalent states and therefore cannot generalize to new states at test time. Our method, NCS, captures both reachability between known states and new, unseen states that can be mapped to the same abstract state.



Figure 1: NCS uses a two-level hierarchy to abstract sensorimotor interactions into a graph of learned state transitions. The affected entity is in black.

Guestrin et al. (2003a), relational MDPs Wang et al. (2008); Guestrin et al. (2003b); Gardiol & Kaelbling (2003), and object-oriented MDPs Diuk et al. (2008); Abel et al.

