AUTONOMOUS LEARNING OF OBJECT-CENTRIC ABSTRACTIONS FOR HIGH-LEVEL PLANNING Anonymous

Abstract

We propose a method for autonomously learning an object-centric representation of a continuous and high-dimensional environment that is suitable for planning. Such representations can immediately be transferred between tasks that share the same types of objects, resulting in agents that require fewer samples to learn a model of a new task. We first demonstrate our approach on a simple domain where the agent learns a compact, lifted representation that generalises across objects. We then apply it to a series of Minecraft tasks to learn object-centric representations, including object types-directly from pixel data-that can be leveraged to solve new tasks quickly. The resulting learned representations enable the use of a tasklevel planner, resulting in an agent capable of forming complex, long-term plans with considerably fewer environment interactions. 1

1. INTRODUCTION

Model-based methods are a promising approach to improving sample efficiency in reinforcement learning. However, they require the agent to either learn a highly detailed model-which is infeasible for sufficiently complex problems (Ho et al., 2019) -or to build a compact, high-level model that abstracts away unimportant details while retaining only the information required to plan. This raises the question of how best to build such an abstract model. Fortunately, recent work has shown how to learn an abstraction of a task that is provably suitable for planning with a given set of skills (Konidaris et al., 2018) . However, these representations are highly task-specific and must be relearned for any new task, or even any small change to an existing task. This makes them fatally impractical, especially for agents that must solve multiple complex tasks. We extend these methods by incorporating additional structure-namely, that the world consists of objects, and that similar objects are common amongst tasks. This can substantially improve learning efficiency, because an object-centric model can be reused wherever that same object appears (within the same task, or across different tasks) and can also be generalised across objects that behave similarly-object types. We assume that the agent is able to individuate the objects in its environment, and propose a framework for building portable object-centric abstractions given only the data collected by executing high-level skills. These abstractions specify both the abstract object attributes that support high-level planning, and an object-relative lifted transition model that can be instantiated in a new task. This reduces the number of samples required to learn a new task by allowing the agent to avoid relearning the dynamics of previously seen object types. We make the following contributions: under the assumption that the agent can individuate objects in its environment, we develop a framework for building portable, object-centric abstractions, and for estimating object types, given only the data collected by executing high-level skills. We also show how to integrate problem-specific information to instantiate these representations in a new task. This reduces the samples required to learn a new task by allowing the agent to avoid relearning the dynamics of previously-seen objects. We demonstrate our approach on a Blocks World domain, and then apply it to a series of Minecraft tasks where an agent autonomously learns an abstract representation of a high-dimensional task from raw pixel input. In particular, we use the probabilistic planning domain definition language (PPDDL) (Younes & Littman, 2004) to represent our learned abstraction, which allows for the use of existing task-level planners. Our results show that an agent can leverage these portable abstractions to learn a representation of new Minecraft tasks using a diminishing number of samples, allowing it to quickly construct plans consisting of hundreds of low-level actions.

2. BACKGROUND

We assume that tasks are modelled as semi-Markov decision processes M = S, O, T , R where (i) S is the state space; (ii) O(s) is the set of temporally-extended actions known as options available at state s; (iii) T describes the transition dynamics, specifying the probability of arriving in state s after option o is executed from s; and (iv) R specifies the reward for reaching state s after executing option o in state s. An option o is defined by the tuple I o , π o ; β o , where I o is the initiation set that specifies the states in which the option can be executed, π o is the option policy which specifies the action to execute, and β o specifies the probability of the option terminating execution in each state (Sutton et al., 1999) . We adopt the object-centric formulation from Ugur & Piater (2015): in a task with n objects, the state is represented by the set {f a , f 1 , f 2 , . . . , f n }, where f a is a vector of the agent's features and f i is a vector of features particular to object i. Note that the feature vector describing each object can itself be arbitrarily complex, such as an image or voxel grid-in this work we use pixels. Our state space representation assumes that individual objects have already been factored into their constituent low-level attributes. Practically, this means that the agent is aware that the world consists of objects, but is unaware of what the objects are, or if there are multiple instantiations of the same object present. It is also easy to see that different tasks will likely have differing numbers of objects with potentially arbitrary ordering; any learned abstract representation should be agnostic to this.

2.1. STATE ABSTRACTIONS FOR PLANNING

We intend to learn an abstract representation suitable for planning. Prior work has shown that a sound and complete abstract representation must necessarily be able to estimate the set of initiating and terminating states for each option (Konidaris et al., 2018) . In classical planning, this corresponds to the precondition and effect of each high-level action operator (McDermott et al., 1998) . The precondition is defined as Pre(o) = Pr(s ∈ I o ), which is a probabilistic classifier that expresses the probability that option o can be executed at state s. Similarly, the effect or image represents the distribution of states an agent may find itself in after executing o from states drawn from distribution Z (Konidaris et al., 2018)  : Im(Z, o) = 1 G S Pr(s | s, o)Z(s) Pr(s ∈ I o )ds , where G = S Z(s) Pr(s ∈ I o ). Since the precondition is a probabilistic classifier and the effect is a probabilistic density estimator, they can be learned directly from option execution data. We can use preconditions and effects to evaluate the probability of a sequence of options-a planexecuting successfully. Given an initial state distribution, the precondition is used to evaluate the probability that the first option can execute, and the effects are used to determine the resulting state distribution. We can apply the same logic to the subsequent options to compute the probability of the entire plan executing successfully. It follows that these representations are sufficient for evaluating the probability of successfully executing any plan (Konidaris et al., 2018) . Partitioned Options For large or continuous state spaces, estimating Pr(s | s, o) is difficult because the worst case requires learning a distribution conditioned on every state. However, if we assume that terminating states are independent of starting states, we can make the simplification Pr(s | s, o) = Pr(s | o). These subgoal options (Precup, 2000) are not overly restrictive, since they refer to options that drive an agent to some set of states with high reliability. Nonetheless, many options are not subgoal. It is often possible, however, to partition an option's initiation set into a finite number of subsets, so that it is approximately subgoal when executed from any of the individual subsets. That is, we partition an option o's start states into finite regions C such that Pr(s | s, o, c) ≈ Pr(s | o, c), c ∈ C (Konidaris et al., 2018) . As in prior work (Andersen & Konidaris, 2017; Konidaris et al., 2018; Ames et al., 2018) , we achieve this in practice by clustering options based on their terminating states.



More results and videos can be found at: https://sites.google.com/view/mine-pddl

