POLICY ARCHITECTURES FOR COMPOSITIONAL GENERALIZATION IN CONTROL

Abstract

Several tasks in control, robotics, and planning can be specified through desired goal configurations for entities in the environment. Learning goal-conditioned policies is a natural paradigm to solve such tasks. However, learning and generalizing on complex tasks can be challenging due to variations in number of entities or compositions of goals. To address this challenge, we introduce the Entity-Factored Markov Decision Process (EFMDP), a formal framework for modeling the entitybased compositional structure in control tasks. Geometrical properties of the EFMDP framework provide theoretical motivation for policy architecture design, particularly Deep Sets and popular relational mechanisms such as graphs and self attention. These structured policy architectures are flexible and can be trained endto-end with standard reinforcement and imitation learning algorithms. We study and compare the learning and generalization properties of these architectures on a suite of simulated robot manipulation tasks, finding that they achieve significantly higher success rates with less data compared to standard multilayer perceptrons. Structured policies also enable broader and more compositional generalization, producing policies that extrapolate to different numbers of entities than seen in training, and stitch together (i.e. compose) learned skills in novel ways. Video results can be found at https://sites.google.

1. INTRODUCTION

Train Test Goal specification is a powerful abstraction for training and deploying AI agents (Kaelbling, 1993) For instance, object reconfiguration (Batra et al., 2020) tasks, like loading plates in a dishwasher or arranging pieces on a chess board, can be described through spatial and semantic goals for various objects. In addition, the goal for a scene can be described through compositions of goals for individual entities in it. Through this work, we introduce a new framework for modeling tasks with such entity-centric compositional structure, applicable to domains like robotic manipulation, multi-agent systems, and strategic game-playing. Subsequently, we study policy architectures that can utilize structural properties unique to our framework for goalconditioned reinforcement and imitation learning. Through experiments in simulated robot manipulation tasks, we find that our policy architectures exhibit significantly improved learning efficiency and generalization performance compared to standard multi-layer perceptrons (MLPs), as previewed in Figure 1 . More importantly, our architectures are capable of learning near-optimal policies in complex table top manipulation tasks where MLP baselines completely fail. Consider the motivating task of arranging pieces on a chess board using a robot arm. A naive specification would provide goal locations for all 32 pieces simultaneously. However, we can 1



Figure 1. A family of tasks where the agent is trained to re-arrange three cubes (top-left), but tested zero-shot to re-arrange more cubes (bottom-left). RL with standard MLPs fails to even learn the 3-cube task. Our selfattention policy, on the other hand, successfully learns and extrapolates.

