SKILL MACHINES: TEMPORAL LOGIC COMPOSITION IN REINFORCEMENT LEARNING

Abstract

A major challenge in reinforcement learning is specifying tasks in a manner that is both interpretable and verifiable. One common approach is to specify tasks through reward machines-finite state machines that encode the task to be solved. We introduce skill machines, a representation that can be learned directly from these reward machines that encode the solution to such tasks. We propose a framework where an agent first learns a set of base skills in a reward-free setting, and then combines these skills with the learned skill machine to produce composite behaviours specified by any regular language and even linear temporal logics. This provides the agent with the ability to map from complex logical task specifications to near-optimal behaviours zero-shot. We demonstrate our approach in both a tabular and high-dimensional video game environment, where an agent is faced with several of these complex, long-horizon tasks. Our results indicate that the agent is capable of satisfying extremely complex task specifications, producing near optimal performance with no further learning. Finally, we demonstrate that the performance of skill machines can be improved with regular off-policy reinforcement learning algorithms when optimal behaviours are desired.

1. INTRODUCTION

Reinforcement learning (RL) is a promising framework for developing truly general agents capable of acting autonomously in the real world. Despite recent successes in the field, ranging from video games (Badia et al., 2020) to robotics (Levine et al., 2016) , there are several shortcomings to existing approaches that hinder RL's real-world applicability. One issue is that of sample efficiency-while it is possible to collect millions of data points in a simulated environment, it is simply not feasible to do so in the real world. This inefficiency is exacerbated when a single agent is required to solve multiple tasks (as we would expect of a generally intelligent agent). One approach of generally intelligent agents to overcoming this challenge is their ability to reuse learned behaviours to solve new tasks (Taylor & Stone, 2009) , preferably without further learning. That is, to rely on composition, where an agent first learns individual skills and then combines them to produce novel behaviours. There are several notions of compositionality in the literature, such as temporal composition, where skills are invoked one after the other ("pickup a blue object then a box") (Sutton et al., 1999; Barreto et al., 2019) , and spatial composition, where skills are combined to produce a new behaviour to be executed ("pickup a blue box") (Todorov, 2009; Saxe et al., 2017; Van Niekerk et al., 2019; Alver & Precup, 2022) . Notably, work by Nangue Tasse et al. (2020) has demonstrated how an agent can learn skills that can be combined using Boolean operators, such as negation and conjunction, to produce semantically meaningful behaviours without further learning. An important, additional benefit of this compositional approach is that it provides a way to address another key issue with RL: tasks, as defined by reward functions, can be notoriously difficult to specify. This may lead to undesired behaviours that are not easily interpretable and verifiable. Composition that enables simpler task specifications and produces reliable behaviours thus represents a major step towards safe AI (Cohen et al., 2021) . Unfortunately, these compositions are strictly spatial. Thus, another issue arises when an agent is required to solve a long horizon task. In this case, it is often near impossible for the agent to solve the task, regardless of how much data it collects, since the sequence of actions to execute before a learning signal is received is too large (Arjona-Medina et al., 2019) . This can be mitigated by leveraging higherorder skills, which shorten the planning horizon (Sutton et al., 1999) . One specific implementation of this is reward machines-finite state machines that encode the tasks to solve (Icarte et al., 2018) . While reward machines obviate the sparse reward problem, used in isolation, they still require the agent to learn how to solve a given task through environment interaction, and the subsequent solution is monolithic, resulting in the afore mentioned problems with applicability to new tasks and reliability. In this work, we combine these two approaches to develop an agent capable of zero-shot spatial and temporal composition. We particularly focus on temporal logic composition, such as linear temporal logic (LTL) (Pnueli, 1977) , allowing agents to sequentially chain and order their skills while ensuring certain conditions are always or never met. We make the following contributions: (a) we propose skill machines, a finite state machine that can be autonomously learned by a compositional agent, and which can be used to solve any task expressible as a finite state machine without further learning; (b) we prove that these skill machines are satisficing-given a task specification, an agent can successfully solve it while adhering to any constraints; and (c) we demonstrate our approach in several environments, including a high-dimensional video game domain. Having learned a set of base skills in a reward-free setting (in the absence of task rewards from a reward machine) , our results indicate that our method is capable of producing near-optimal behaviour for a variety of long-horizon tasks without further learning. To describe our approach to temporal composition, we use the Office Gridworld (Icarte et al., 2018) as a running example. In the environment, illustrated by Figure 1a the agent is at their respective locations, + is true when the agent is at and there is mail to be collected, and + is true when the agent is at and there is someone in the office. the finite state machine representing both the reward and skill machine for the task "deliver coffee and mail to the office without breaking any decoration" where the black dots labeled t represent terminal states. The reward machine gives rewards (δ r ) to the agent for each FSM state and the skill machine gives the composed skills δ Q that maximises those rewards. For example at u 0 , δ r (u 0 ) = 0.5(R ∧¬✽ ) + 0.5(R ∧¬✽ ) and δ Q (u 0 ) = 0.5(Q ∧ ¬Q ✽ ) + 0.5(Q ∧ ¬Q ✽ ).

2. BACKGROUND

We model the agent's interaction with the world as a Markov Decision Process (MDP), given by (S, A, P, R, γ), where (i) S ⊆ R n is the n-dimensional state space; (ii) A is the set of (possibly continuous) actions available to the agent; (iii) P (s ′ |s, a) is the dynamics of the world, representing the probability of the agent reaching state s ′ after executing action a in state s; (iv) R is a reward function bounded by [R MIN , R MAX ] that represents the task the agent needs to solve; and (v) γ ∈ [0, 1] is a discount factor. The aim of the agent is to compute a Markov policy π from S to A that optimally solves a given task. Instead of directly learning a policy, an agent will often



, an agent (blue circle) can move to adjacent cells in any of the cardinal directions. It can also pick up coffee or mail at locations or respectively, and it can deliver them to the office at location . Cells marked ✽ indicate decorations that are broken if the agent collides with them, and cells marked A-D indicate the centres of the corner rooms. The reward machines that specify tasks in this environment are defined over 10 propositions: P = {A, B, C, D, ✽, , , , + , + }, where the first 8 propositions are true when

Figure 1: Illustration of (a) the office gridworld where the blue circle represents the agent and (b)the finite state machine representing both the reward and skill machine for the task "deliver coffee and mail to the office without breaking any decoration" where the black dots labeled t represent terminal states. The reward machine gives rewards (δ r ) to the agent for each FSM state and the skill machine gives the composed skills δ Q that maximises those rewards. For example at u 0 , δ r (u 0 ) = 0.5(R ∧¬✽ ) + 0.5(R

