POLICY ARCHITECTURES FOR COMPOSITIONAL GENERALIZATION IN CONTROL

Abstract

Several tasks in control, robotics, and planning can be specified through desired goal configurations for entities in the environment. Learning goal-conditioned policies is a natural paradigm to solve such tasks. However, learning and generalizing on complex tasks can be challenging due to variations in number of entities or compositions of goals. To address this challenge, we introduce the Entity-Factored Markov Decision Process (EFMDP), a formal framework for modeling the entitybased compositional structure in control tasks. Geometrical properties of the EFMDP framework provide theoretical motivation for policy architecture design, particularly Deep Sets and popular relational mechanisms such as graphs and self attention. These structured policy architectures are flexible and can be trained endto-end with standard reinforcement and imitation learning algorithms. We study and compare the learning and generalization properties of these architectures on a suite of simulated robot manipulation tasks, finding that they achieve significantly higher success rates with less data compared to standard multilayer perceptrons. Structured policies also enable broader and more compositional generalization, producing policies that extrapolate to different numbers of entities than seen in training, and stitch together (i.e. compose) learned skills in novel ways. Video results can be found at https://sites.google.

1. INTRODUCTION

Train Test Goal specification is a powerful abstraction for training and deploying AI agents (Kaelbling, 1993) For instance, object reconfiguration (Batra et al., 2020) tasks, like loading plates in a dishwasher or arranging pieces on a chess board, can be described through spatial and semantic goals for various objects. In addition, the goal for a scene can be described through compositions of goals for individual entities in it. Through this work, we introduce a new framework for modeling tasks with such entity-centric compositional structure, applicable to domains like robotic manipulation, multi-agent systems, and strategic game-playing. Subsequently, we study policy architectures that can utilize structural properties unique to our framework for goalconditioned reinforcement and imitation learning. Through experiments in simulated robot manipulation tasks, we find that our policy architectures exhibit significantly improved learning efficiency and generalization performance compared to standard multi-layer perceptrons (MLPs), as previewed in Figure 1 . More importantly, our architectures are capable of learning near-optimal policies in complex table top manipulation tasks where MLP baselines completely fail. Consider the motivating task of arranging pieces on a chess board using a robot arm. A naive specification would provide goal locations for all 32 pieces simultaneously. However, we can immediately recognize that the task is a composition of 32 sub-goals involving the rearrangement of individual pieces. This understanding of compositional structure can allow us to focus on one object at a time, dramatically reducing the size of effective state space and help combat the curse of dimensionality that plagues RL (Sutton and Barto, 1998; Bertsekas and Tsitsiklis, 1996) . Moreover, such a compositional understanding would make an agent invariant to the number of objects, enabling generalization to fewer or more objects. Most importantly, it can enable reusing shared skills like pick-and-place, enhancing the learning efficiency. We finally note that a successful policy cannot completely decouple the sub-tasks. For example, if a piece must be moved to a square currently occupied by another piece, the piece in the destination square must be moved first. The generic Markov Decision Process (MDP) framework as well as policy architectures based on MLPs lack the aforementioned compositional properties. To overcome this limitation, we turn to the general field of "geometric deep learning" (Bronstein et al., 2021) which is concerned with the study of structures, symmetries, and invariances exhibited by function classes. We first introduce the Entity-Factored MDP (EFMDP), a subclass of the generic MDP, as a formal model for decision making in environments with multiple entities (e.g. objects). We then characterize the geometric properties of EFDMP relative to the generic MDP. We subsequently show how set-based invariant architectures like Deep Sets (Zaheer et al., 2017) and relational architectures like Self-Attention (Vaswani et al., 2017) and Graph Convolution (Kipf and Welling, 2016) are well suited to leverage the geometric properties of the EFMDP. Through experiments, we demonstrate that policies and critics parameterized by these architectures can be trained to solve complex tasks using standard RL and IL algorithms, without assuming access to any options or action primitives. Our Contributions. This paper is organized into sections that present our three main contributions: 1. We develop the Entity-Factored MDP (EFMDP) framework, a formal model for decision making in tasks comprising of multiple entities (e.g. objects), and characterize its geometric properties. 2. We show how policies and critics parameterized by set-based invariance models (e.g. Deep Sets) or relational models (e.g. Self-Attention and Graph Convolution) can leverage the geometric properties of the EFMDP. 3. We empirically evaluate these structured architectures on a suite of simulated robot manipulation tasks (Figure 4 ), and find that they generalize more broadly while also learning more efficiently. Compared to MLPs, structured policies improve success rates by more than 50× on extrapolation tests which vary the numbers of entities in the environment, and by 10× on stitching tests that require recombining learned skills in novel ways to solve new unsteen tasks.

2. PROBLEM FORMULATION AND ARCHITECTURES

In this section, we first formalize our problem setup by introducing the entity-factored MDP (EFMDP). This setting is capable of modeling many applications including table-top manipulation, scene reconfiguration, and muti-agent learning. Subsequently, we also introduce policy architectures that can enable efficient learning and generalization by utilizing the EFMDP's unique structural properties.

2.1. PROBLEM SETUP

We study a learning paradigm where the agent can interact with many entities in an environment. The task for the agent is specified in the form of goals for some subset of entities (including the agent). We formalize this learning setup with the Entity-Factored Markov Decision Process (EFMDP). Definition 1 (Entity-Factored MDP). An EFMDP with N entities is described through the tuple: M E := ⟨U, E, g, A, P, R, γ⟩. Here U and E are the agent and entity state spaces, g is the subgoal space and A is the agent's action space. The overall state space S := U × E N has elements s = (u, e 1 , • • • , e N ) and the overall goal space G := g N has elements g = (g 1 , . . . , g N ). The reward and dynamics are described by: R (s, g) := R {r(e i , g i , u)} N i=1 (1) P(s ′ |s, a) := P u ′ , {e ′ i } N i=1 | u, {e i } N i=1 , a for s, s ′ ∈ S, a ∈ A, and g ∈ G.



Figure 1. A family of tasks where the agent is trained to re-arrange three cubes (top-left), but tested zero-shot to re-arrange more cubes (bottom-left). RL with standard MLPs fails to even learn the 3-cube task. Our selfattention policy, on the other hand, successfully learns and extrapolates.

