SKILL MACHINES: TEMPORAL LOGIC COMPOSITION IN REINFORCEMENT LEARNING

Abstract

A major challenge in reinforcement learning is specifying tasks in a manner that is both interpretable and verifiable. One common approach is to specify tasks through reward machines-finite state machines that encode the task to be solved. We introduce skill machines, a representation that can be learned directly from these reward machines that encode the solution to such tasks. We propose a framework where an agent first learns a set of base skills in a reward-free setting, and then combines these skills with the learned skill machine to produce composite behaviours specified by any regular language and even linear temporal logics. This provides the agent with the ability to map from complex logical task specifications to near-optimal behaviours zero-shot. We demonstrate our approach in both a tabular and high-dimensional video game environment, where an agent is faced with several of these complex, long-horizon tasks. Our results indicate that the agent is capable of satisfying extremely complex task specifications, producing near optimal performance with no further learning. Finally, we demonstrate that the performance of skill machines can be improved with regular off-policy reinforcement learning algorithms when optimal behaviours are desired.

1. INTRODUCTION

Reinforcement learning (RL) is a promising framework for developing truly general agents capable of acting autonomously in the real world. Despite recent successes in the field, ranging from video games (Badia et al., 2020) to robotics (Levine et al., 2016) , there are several shortcomings to existing approaches that hinder RL's real-world applicability. One issue is that of sample efficiency-while it is possible to collect millions of data points in a simulated environment, it is simply not feasible to do so in the real world. This inefficiency is exacerbated when a single agent is required to solve multiple tasks (as we would expect of a generally intelligent agent). One approach of generally intelligent agents to overcoming this challenge is their ability to reuse learned behaviours to solve new tasks (Taylor & Stone, 2009) , preferably without further learning. That is, to rely on composition, where an agent first learns individual skills and then combines them to produce novel behaviours. There are several notions of compositionality in the literature, such as temporal composition, where skills are invoked one after the other ("pickup a blue object then a box") (Sutton et al., 1999; Barreto et al., 2019) , and spatial composition, where skills are combined to produce a new behaviour to be executed ("pickup a blue box") (Todorov, 2009; Saxe et al., 2017; Van Niekerk et al., 2019; Alver & Precup, 2022) . Notably, work by Nangue Tasse et al. (2020) has demonstrated how an agent can learn skills that can be combined using Boolean operators, such as negation and conjunction, to produce semantically meaningful behaviours without further learning. An important, additional benefit of this compositional approach is that it provides a way to address another key issue with RL: tasks, as defined by reward functions, can be notoriously difficult to specify. This may lead to undesired behaviours that are not easily interpretable and verifiable. Composition that enables simpler task specifications and produces reliable behaviours thus represents a major step towards safe AI (Cohen et al., 2021) . Unfortunately, these compositions are strictly spatial. Thus, another issue arises when an agent is required to solve a long horizon task. In this case, it is often near impossible for the agent to solve the task, regardless of how much data it collects, since the sequence of actions to execute before a learning signal is received is too large (Arjona-Medina et al., 2019) . This can be mitigated by leveraging higherorder skills, which shorten the planning horizon (Sutton et al., 1999) . One specific implementation of this is reward machines-finite state machines that encode the tasks to solve (Icarte et al., 2018) . While reward machines obviate the sparse reward problem, used in isolation, they still require the agent to learn how to solve a given task through environment interaction, and the subsequent solution is monolithic, resulting in the afore mentioned problems with applicability to new tasks and reliability. In this work, we combine these two approaches to develop an agent capable of zero-shot spatial and temporal composition. We particularly focus on temporal logic composition, such as linear temporal logic (LTL) (Pnueli, 1977) , allowing agents to sequentially chain and order their skills while ensuring certain conditions are always or never met. We make the following contributions: (a) we propose skill machines, a finite state machine that can be autonomously learned by a compositional agent, and which can be used to solve any task expressible as a finite state machine without further learning; (b) we prove that these skill machines are satisficing-given a task specification, an agent can successfully solve it while adhering to any constraints; and (c) we demonstrate our approach in several environments, including a high-dimensional video game domain. Having learned a set of base skills in a reward-free setting (in the absence of task rewards from a reward machine) , our results indicate that our method is capable of producing near-optimal behaviour for a variety of long-horizon tasks without further learning. To describe our approach to temporal composition, we use the Office Gridworld (Icarte et al., 2018) as a running example. In the environment, illustrated by Figure 1a the agent is at their respective locations, + is true when the agent is at and there is mail to be collected, and + is true when the agent is at and there is someone in the office. the finite state machine representing both the reward and skill machine for the task "deliver coffee and mail to the office without breaking any decoration" where the black dots labeled t represent terminal states. The reward machine gives rewards (δ r ) to the agent for each FSM state and the skill machine gives the composed skills δ Q that maximises those rewards. For example at u 0 , δ r (u 0 ) = 0.5(R ∧¬✽ ) + 0.5(R ∧¬✽ ) and δ Q (u 0 ) = 0.5(Q ∧ ¬Q ✽ ) + 0.5(Q ∧ ¬Q ✽ ).

2. BACKGROUND

We model the agent's interaction with the world as a Markov Decision Process (MDP), given by (S, A, P, R, γ), where (i) S ⊆ R n is the n-dimensional state space; (ii) A is the set of (possibly continuous) actions available to the agent; (iii) P (s ′ |s, a) is the dynamics of the world, representing the probability of the agent reaching state s ′ after executing action a in state s; (iv) R is a reward function bounded by [R MIN , R MAX ] that represents the task the agent needs to solve; and (v) γ ∈ [0, 1] is a discount factor. The aim of the agent is to compute a Markov policy π from S to A that optimally solves a given task. Instead of directly learning a policy, an agent will often instead learn a value function that represents the expected return following policy π from state s: V π (s) = E π [ ∞ t=0 γ t R(s t , a t )]. A more useful form of value function is the action-value function Q π (s, a), which represents the expected return obtained by executing a from s, and then following π. The optimal action-value function is given by Q * (s, a) = max π Q π (s, a) for all states s and actions a, and the optimal policy follows by acting greedily with respect to Q * at each state.

2.1. LOGICAL COMPOSITION IN THE MULTITASK SETTING

We are interested in the multitask setting, where an agent is required to reach a set of goals in some goal space G ⊆ S. We assume that all tasks share the same state space, action space and dynamics, but differ in their reward functions. We model this setting by defining a background MDP M = ⟨S, A, P, R, γ⟩ with its own state space, action space, transition dynamics and background reward function: R. Any individual task τ is then specified by a task-specific reward function R τ that is non-zero only for states in G. The reward function for the resulting task MDP is then simply R+R τ . Nangue Tasse et al. (2020) consider the case where R τ (g, a) ∈ {R MIN , R MAX } and develop a framework that allows agents to apply the Boolean operations of conjunction (∧), disjunction (∨) and negation (¬) over the space of tasks and value functions. This is achieved by first defining the goal-oriented reward function R which extends the task rewards (R + R τ ) to penalise an agent for achieving goals different from the one it wished to achieve: R(s, g, a) := R MISS if g ̸ = s; where g ∈ G and s is absorbing R(s, a) + R τ (s, a) otherwise, R MISS is a large negative penalty that can be derived from the bounds of the reward function. Using Equation 1, we can define the related goal-oriented value function as: Q(s, g, a) = R(s, g, a) + γ S V π (s ′ , g)P (s,a) (ds ′ ), where V π (s, g) = E π ∞ t=0 γ t R(s t , g, a t ) . If a new task can be represented as the logical expression of previously learned tasks, Nangue Tasse et al. (2020) prove that the optimal policy can immediately be obtained by composing the learned goal-oriented value functions using the same expression. For example, the union (∨), intersection (∧), and negation (¬) of two goal-reaching tasks A and B can be solved as follows (we omit the value functions' parameters for readability): Q * A∨B = Q * A ∨ Q * B := max{ Q * A , Q * B }; Q * A∧B = Q * A ∧ Q * B := min{ Q * A , Q * B }; Q * ¬A = ¬ Q * A := Q * SU P + Q * IN F -Q * A where Q * SU P and Q * IN F are the goal-oriented value functions for the maximum task (R τ = R MAX for all G) and minimum task (R τ = R MIN for all G), respectively. Following Nangue Tasse et al. ( 2022), we will also refer to these goal-oriented value functions as world value functions (WVFs).

2.2. REWARD MACHINES

One difficulty with the standard MDP formulation is that the agent is often required to solve a complex long-horizon task using only a scalar reward signal as feedback from which to learn. To overcome this, Icarte et al. (2018) propose reward machines (RMs), which provide structured feedback to the agent in the form of a finite state machine (FSM). RMs encode a reward function using a set of propositional symbols P that represent abstract environment features as follows: Definition 1 (Reward Machine). Given a set of states S and actions A, a reward machine is a tuple R SA = ⟨U, u 0 , F, δ u , δ r ⟩ where (i) U is a finite set of states; (ii) u 0 ∈ U is an initial state; (iii) F is a finite set of terminal states; (iv) δ u : U × [S × A × S] → U ∪ F is the state-transition function; and (v) δ r : U → [S × A × S → R] is the state-reward function. RMs consist of a finite set of states U where transitions between RM states are governed by δ u , and where each RM state emits a reward function according to δ r . To incorporate RMs into the RL framework, the agent must be able to determine a correspondence between abstract RM propositions (P = {A, B, C, D, ✽, , , , + , + } for example) and states in the environment. To achieve this, the agent is equipped with a labelling function L : S × A × S → 2 P that assigns truth values to the propositions based on the agent's interaction with its environment. Thus, 2 P ⊂ S × A × S depicts an equivalence class from S × A × S. A particular instantiation of an RM that is used in practice-for example when converting an LTL specification to an RM-is a simple reward machine (SRM denoted similarly as R PA ), which restricts the form of the state-reward function to be δ r : et al., 2018) . In other words, the SRM state-reward function emits a function which maps the simpler equivalence class of states to a scalar reward. The agent's aim then is to learn a policy π : S × U → A over the joint background MDP and RM (MDPRM), which is defined by the tuple T = ⟨S, A, P, γ, P, L, U, u 0 , F, δ u , δ r ⟩. However, the rewards from the reward machine are not necessarily Markov with respect to the environment. Icarte et al. (2018) shows that a product MDPRM (Definition 2 below) guarantees that the rewards are Markov such that the policy can be learned with standard algorithms like Q-learning (Icarte et al., 2018) . This is because the product MDPRM uses the cross-product to consolidate how actions in the environment result in simultaneous transitions in the environment and state machine. Thus, product MDPRMs take the form of standard, learnable MDPs. Definition 2 (Product MDPRM). Let T = ⟨S, A, P, γ, P, L, U, u 0 , F, δ u , δ r ⟩ be an MDPRM. The product MDPRM is then defined by the tuple M T = ⟨S T , A, P T , R T , γ⟩ where U → [2 P → R] (Icarte S T := S × (U ∪ F ), R T (⟨s, u⟩, a, ⟨s ′ , u ′ ⟩) := δ r (u)(s, a, s ′ ), P T (⟨s, u⟩, a) := ⟨s ′ , u ′ ⟩ if u ∈ U ⟨s ′ , u⟩ otherwise , s ′ ∼ P (•|s, a) and u ′ = δ u (u, (s, a, s ′ )).

3. LEVERAGING SKILL COMPOSITION FOR TEMPORAL LOGIC TASKS

Since we are interested in temporal logic tasks, we will restrict our attention to RMs whose rewards per node are specified by linear preferences over Boolean expressions (instead of arbitrary real-valued functions that are not grounded in achieving goals in an environment): Definition 3 (Tasks). Let M = ⟨S, A, P, R, γ⟩ be a background MDP. A task is a product MDPRM M T = ⟨S T , A, P T , R T , γ⟩ over M and a reward machine with reward function δ r (u) ∈ R w (s, a, s ′ ) = p∈2 2 P w p R p (s, a, s ′ ) : R p (s, a, s ′ ) p∈2 2 P w p = 1 and w ∈ R 2 2 P , where R p (s, a, s ′ ) :=    R MAX if L(s, a, s ′ ) ∈ p R MIN if L(s, a, s ′ ) ̸ ∈ p R(s, a) otherwise. We will assume that the rewards R p (s, a, s ′ ) are such that the policies that maximises them are guaranteed to reach states where the corresponding propositions p are true-a common example is to have R(s, a) = R MIN = 0 and R MAX = 1. This definition of RMs provides a general notion of tasks that are still grounded in achieving goals. Figure 1b illustrates an example of an RM in the office gridworld for solving the task "deliver coffee and mail to the office without breaking any decoration".

3.1. FROM ENVIRONMENT TO PRIMITIVES

In order to solve temporal logic tasks zero-shot, we propose to first learn a set of primitive skills which can later be composed to maxise the rewards per RM node without further learning. To achieve this, we first introduce the concept of constraints C ⊆ P, which are the set of propositions that an agent should avoid setting to true and corresponds to the global operator G in a linear temporal logic (LTL) specification. An example of a constraint might be that the agent should complete a task, but avoid breaking any decorations while doing so (C = {✽} and in the LTL we say G ¬✽). We can now define the notions of task primitives and skill primitives such as "Pick up coffee" (F in LTL) or "don't break any decoration" (¬(F ✽) = G ¬✽ in LTL). Definition 4 (Primitives). Let M = ⟨S, A, P, R, γ⟩ be a background MDP. We define a task primitive in this domain as M p = ⟨S G , A G , P G , R p , γ⟩, p ∈ 2 2 P , with absorbing goal space G = 2 P and labelling function L, where S G := (S × 2 C ) ∪ 2 P , where C is the set of constraints; A G := A × A τ , where A τ = {0, 1} represents whether or not to terminate a task; P G (⟨s, c⟩, ⟨a, a τ ⟩) := L(s, a, s ′ ) if a τ = 1 ⟨s ′ , c ′ ⟩ otherwise , where s ′ ∼ P (•|s, a) and c ′ = c ∪ (C ∩ L(s, a, s ′ )); R p (⟨s, c⟩, ⟨a, a τ ⟩) :=    R MAX if a τ = 1 and L(s, a, s ′ ) ∈ p R MIN if a τ = 1 and L(s, a, s ′ ) ̸ ∈ p R(s, a) otherwise. A skill primitive Q * p is defined as the WVF for the task primitive M p . The above defines the state space of primitives to be the product of the environment states and the set of constraints, incorporating the set of propositions that are currently true. The action space is augmented with a terminating action following Barreto et al. (2019) and Nangue Tasse et al. ( 2020), which indicates that the agent wishes to achieve the goal it is currently at, and is similar to an option's termination condition (Sutton et al., 1999) . The transition dynamics update the environment state and constraints set to true when a regular action is taken, and use the labelling function to return the set of propositions achieved when the agent decides to terminate. Finally, the agent receives the regular background reward when taking an action, but a primitive-specific goal reward when it terminates. Importantly, primitives are temporally atomic, that is, they correspond to tasks with a single non-terminal RM state. They are, thus, the smallest unit of temporal logic. However, since the goal space of task primitives are defined by Boolean propositions, we can leverage prior work to solve any logical composition over them by composing their corresponding skill primitives (Nangue Tasse et al., 2020). We will denote the set of base task primitives to be M P and the corresponding base skill primitives Q * P , which can be composed to obtain any other primitive. For example: "Pick up coffee without breaking any decoration" ((F ) ∧ ¬(F ✽)) is another primitive by Definition 4. As we discuss in Section 3.2, this solves the primary problem with Reward Machines -they suffer from the curse of dimensionality when all possible primitives must be relearned at all states in the FSM. Skill Machines in contrast leverage primitive composition within and across FSM states. Theorem 1 below demonstrates that a linear combination of skill primitives maximise the task (in terms of Definition 3) rewards per RM node without further learning (proofs of all theorems are presented in the Appendix). This is also demonstrated experimentally in Figure 8 in Appendix A.5. Theorem 1. Let R G be a vector of rewards for each task primitive, and Q * G be the corresponding vector of optimal WVFs. Then, for an MDP m = ⟨S G , A G , P G , R w , γ⟩ with linear preference reward function R w = w • R G , we have Q * m = w • Q * G .

3.2. FROM TASKS TO SKILL MACHINES

We now have agents capable of solving any logical and linear composition of base task primitives M P by only learning their corresponding base skill primitives Q * P . Given this compositional ability over skills, and reward machines that expose the structure of tasks, agents can solve temporally extended tasks with little or no further learning. To achieve this, we define a skill machine (SM) as a representation of logical and temporal knowledge over skills. Definition 5 (Skill Machine). Given a task M T = ⟨S T , A, P T , R T , γ⟩ defined by a reward machine R SA = ⟨U, u 0 , F, δ u , δ r ⟩, a set of propositional symbols P with constraints C ⊆ P, and their corresponding base skill primitives Q * P , a skill machine is a tuple Q * SA = ⟨U, u 0 , F, δ u , δ Q , w U , w G ⟩ where (i) w U : U × U → R is a preference function over transitions; (ii) w G : S G × G → R is a pref- erence function over goals; and (iii) δ Q : U → [S G × A G → R] is the state-skill function defined by: δ Q (u)(⟨s, c⟩, ⟨a, 0⟩) → g∈G u ′ ∈U w G (⟨s, c⟩, u, g)w U (u, u ′ ) Q * u,u ′ (⟨s, c⟩, g, ⟨a, 0⟩), where Q * u,u ′ is the WVF obtained by composing the skill primitives Q * G according to the Boolean expression for the transition δ u (u)(s, a, s ′ ) = u ′ . For a given state s ∈ S in the environment, true constraints c ∈ C, and state u in the skill machine, the skill machine uses its preference over transitions w U and goals w G to compute a skill Q(⟨s, c⟩, ⟨a, 0⟩) := δ Q (u)(⟨s, c⟩, ⟨a, 0⟩) that an agent can use to take an action a. The environment then transitions to the next state s ′ where ⟨s ′ , c ′ ⟩ ← P G (⟨s, c⟩, ⟨a, 0⟩) and the skill machine transitions to u ′ ← δ u (u, L(s, a, s ′ )). w U represents cases where there is not necessarily a single desirable transition to follow given the current SM state. This is illustrated by the SM in Figure 1b , where mail ( ) and coffee ( ) are equally desirable at the initial state. Similarly, w G represents cases where there may be a single desirable task, but its goals are not necessarily equally desirable given the environment state-for example when the agent needs to first pick up coffee but there are two coffee locations. Remarkably, there always exists a choice for w U and w G that is optimal with respect to the corresponding reward machine, as shown in Theorem 2: Theorem 2. Let π * (s, u) be the optimal policy for a task M T , and let C = P. Then there exists a corresponding skill machine with a w G and w U such that π * (s, u) ∈ arg max a∈A δ Q (u)(⟨s, c⟩, ⟨a, 0⟩), where δ Q is given by w G and w U as per Definition 5. Theorem 2 shows that skill machines can be used to solve tasks without having to relearn action level policies. The next section shows how an agent can approximate a skill machine by planning over simple reward machines. In the previous section, we introduced skill machines and showed that they can be used to represent the logical and temporal composition of skills needed to solve reward machines. We now show how for simple RMs (RMs returning scalar rewards as defined in Section 2.2) their approximate SM can be obtained zero-shot without further learning. To achieve this, we first plan over the reward machine (using value iteration, for example) to obtain Q-values for each transition. We then select the skills for each SM state greedily. This process is illustrated in Figure 2 . While this only holds for cases where the greedy skills are always satisfice-able from any environment state, this still covers many tasks of interest. In particular, this holds for any RM with non-zero rewards of R MAX only at accepting transitions,foot_0 as shown in Theorem 3. Theorem 3. Let R PA = ⟨U, u 0 , F, δ u , δ M ⟩ be a satisfice-able simple reward machine with non-zero rewards R MAX only for accepting transitions, and for which all valid transitions (u, u ′ ) are achievable from any state s ∈ S. Define the skill machine Q *

3.3. FROM SIMPLE REWARD MACHINES TO SKILL MACHINES

SA = ⟨U, u 0 , F, δ u , δ Q , w U , w G ⟩ with w U (u, u ′ ) := 1 if u ′ = arg max u ′′ Q * (u, u ′′ ), 0 otherwise w G (⟨s, c⟩, u, g) := 1 if g = arg max g ′ max a u ′ w U (u, u ′ ) Q * u,u ′ (⟨s, c⟩, g ′ , ⟨a, 0⟩), 0 otherwise where Q * is the optimal transition-value function for R PA . Then following the policy π(s, u) ∈ arg max a∈A δ Q (u)(⟨s, c⟩, ⟨a, 0⟩), will reach an accepting transition. Theorem 3 is critical as it provides soundness guarantees, ensuring that the policy derived from the skill machine will always satisfice the task requirements. Finally, in cases where the composed skill δ Q obtained from the approximate SM is not sufficiently optimal, we can use any off-policy algorithm to learn a new skill Q T few-shot. This is achieved by using the maximising Q-values max{βQ T , (1β)δ Q } in the exploration policy during learning. Here, β ∈ (0, 1) is a parameter that determines how much of the composed policy to use. It can also be seen as decreasing the potentially overestimated values of δ Q , since δ Q is greedy with respect to both goals and RM transitions. Consider Q-learning with β = γ. During the ϵ-greedy exploration, we use a ← arg max A max{γQ T , (1-γ)δ Q } to select greedy actions, hence improving the initial performance of the agent where γQ T < (1 -γ)δ Q , and guaranteeing convergence in the limit like regular Q-learning. Appendix A.2 illustrates this process.

4. EXPERIMENTS

We consider the Office Gridworld (Figure 1a ) and the Moving Targets (Figure 4 ) domains: 2020), where the agent keeps track of reached goals and uses Q-learning (Watkins, 1989) to update the WVF with respect to all seen goals at every time step. (ii) Moving Targets Domain Nangue Tasse et al. ( 2020): This is a canonical object collection domain with high dimensional pixel observations (84 × 84 × 3 RGB images). The agent here needs to pick up objects of various shapes and colors; picked objects respawn at random empty positions similarly to previous object collection domains (Barreto et al., 2020) . There are 3 object colours-beige ( ), blue ( ), purple ( )-and 2 object shapes-squares ( ), circles ( ). The tasks here are defined over 6 propositions and constraints P = C = { , , , , }. We learn the corresponding base skill primitives with goal oriented Q-learning Nangue Tasse et al. ( 2020) but using deep Q-learning (Mnih et al., 2015) to update the WVFs. Deliver coffee and mail to the office without breaking any decoration | F ∧ X F ∧ X F || F ∧ X F ∧X F ∧ (G¬✽) 4 Deliver mail to the office until there is no mail left, then deliver coffee to office while there are people in the office, then patrol rooms A-B-C-D-A, and never break a decoration | F ∧ X F ∧ X ¬ U ¬ + ∧ ∧ X F ∧ X ¬ U ¬ + ∧ ∧ X (F A ∧ X (F (B ∧ X (F (C ∧ X (F (D ∧ X (F A)))))))) ∧ (G ¬✽) Table 1 : Tasks in the Office Gridworld. The RMs are generated from the LTL expressions. We use the Office Gridworld as a multitask domain, and we evaluate how long it takes an agent to learn a policy that can solve the four tasks described in Table 3 . The agent iterates through the tasks, changing from one to the next after each episode. In all of our experiments, we compare the performance of skill machines with that of state-of-the-art RM-based learning approaches like counterfactual RMs (CRM)-where the Q-functions are updated with respect to all possible RM transitions from a given environment state-and hierarchical RMs (HRM)-where an agent learns options per RM state that are grounded in the environment states (Icarte et al., 2018) . In addition to learning all four tasks, we also experiment with Tasks 3 and 4 in isolation. In these single task domains, the difference between CRM, HRM, skill machines and Q-learning should be less pronounced, since CRM, HRM and few-shot with skill machines now cannot leverage the shared experience across multiple tasks. Thus, the comparison between multi-task and single-task learning in this setting will evaluate the benefit of the compositionality afforded by skill machines. The results of these three experiments (each ran for 2 × 10 5 time steps) are shown in Figure 3 . Regular Q-learning struggles to learn Task 3 and completely fails to learn the hardest task (Task 4). Additionally, notice that while QL and CRM can theoretically learn the tasks optimally given infinite time, only HRM and SM are able to learn hard long horizon tasks in practice. It is important to note that we train all algorithms for the same amount of time during these experiments and previous work (Nangue Tasse et al., 2020) has shown that learning the WVFs takes longer than learning task-specific skills. In addition, the skill machines are being used to zero-shot generalise to the office tasks using skill primitives. Thus using the skill machines in isolation (labelled SM on Figure 3 ) may provide sub-optimal performance compared to the task-specific agents, since the skill machines have not been trained to optimality and are not specialised to the domain. Even under these conditions, we observe that skill machines perform near-optimally in terms of final performance, and due to the amortised nature of learning the WVF will achieve its final rewards from the first epoch. 

4.2. FEW-SHOT TEMPORAL LOGICS

It is possible to pair the skill machines with a learning algorithm such as Q-learning to achieve fewshot generalisation. From the results shown in Figure 3 , it is apparent that skill machines paired with Q-learning (labelled QL-SM on Figure 3 ) achieves the best performance for both the single-task and multi-task setting. While it is not clear from the rewards that adding Q-learning provides significant improvements to the skill machine, their trajectories show that Q-learning does indeed improve on the skill machine policies when they are not optimal (Appendix 9). Additionally, skill machines with Q-learning always begin with a significantly higher reward and converge on their final performance faster than all benchmarks-except the zero-shot one which is (near) optimal in all cases. The speed of learning is due to the compositionality of the skill primitives with skill machines, and the high final performance is due to the generality of the learned primitives being paired with the domain specific Q-learner. In sum, skill machines provide fast composition of skills and achieve optimal performance compared to all benchmarks when paired with a learning algorithm. We now demonstrate our temporal logic composition approach in Moving Targets domain where function approximation is required. Figure 4 shows the average returns of the optimal policies and SM policies for the four tasks described in Table 2 with a maximum of 50 steps per episode. Our results show that even when using function approximation with sub-optimal skill primitives, the zero-shot policies obtained from skill machines are very close to optimal on average. We also observe that for very challenging tasks like Tasks 3 and 4 (where the agent must satisfice difficult temporal constraints), the compounding effect of the sub-optimal policies sometimes leads to failures. In such cases, learning new skills few-shot using tabular Q-learning by leveraging the SM would guarantee convergence to optimal policies as demonstrated in Section 4.2, but that is not guaranteed using function approximation. 

4.3. FUNCTION APPROXIMATION

| F ( ∧ X(F ( ∧ X(F (( ∨ ) ∧ ¬( ∨ )))))) 3 Pick up blue objects or squares, but never blue squares. Repeat this forever. | (F ( ∨ )) ∧ (G ¬( ∧ )) 4 Pick up non-square blue objects, then non-blue squares in that order. Repeat this forever. | F ((¬ ∧ ) ∧ X(F ( ∧ ¬ ))) Table 2 : Tasks in the Moving Targets domain. To repeat forever, the terminal states of the RMs generated from LTL are removed, and transitions to them are looped back to the start state.

5. RELATED WORK

One family of approaches to spatial composition leverages forms of regularisation to achieve semantically meaningful disjunction (Todorov, 2009; Van Niekerk et al., 2019) or conjunction (Haarnoja et al., 2018; Hunt et al., 2019) . Weighted composition has also been demonstrated; for example, Peng et al. ( 2019) learn weights to compose existing policies multiplicatively to solve new tasks. Approaches that leverage the successor feature (SF) framework (Barreto et al., 2017) are capable of solving tasks defined by linear preferences over features (Barreto et al., 2020) . Alver & Precup (2022) show that an SF basis can be learned that is sufficient to span the space of tasks under consideration, while Nemecek & Parr (2021) determine which policies should be stored in limited memory so as to maximise performance on future tasks. In contrast to these approaches, our framework allows for both spatial composition (including operators such as negation that other approaches do not support) and temporal composition such as LTL. A popular way of achieving temporal composition is through the options framework (Sutton et al., 1999; Bacon et al., 2017) . Here, high-level skills are first discovered and then executed sequentially to solve a task (Konidaris & Barto, 2009; Bagaria & Konidaris, 2019) . Barreto et al. (2019) leverage the SF and options framework and learn how to linearly combine skills, chaining them sequentially to solve temporal tasks. However, these options-based approaches offer a relatively simple form of temporal composition. By contrast, we are able to solve tasks expressed through regular languages zero-shot, while providing soundness guarantees. Work has also centred on approaches to defining tasks using human-readable logic operators. For example, Li et al. (2017) and Littman et al. ( 2017) specify tasks using LTL, which is then used to generate a standard reward signal for an RL agent. Camacho et al. (2019) show how to perform reward shaping given LTL specifications, while Jothimurugan et al. (2019) develop a formal language that encodes tasks as sequences, conjunctions and disjunctions of subtasks. This is then used to obtain a shaped reward function that can be used for learning. All of these approaches focus on how an agent can improve learning given such specifications or structure, but we show how an explicitly compositional agent can immediately solve such tasks using WVFs without further learning.

6. CONCLUSION

We proposed skill machines-finite state machines that can be learned from reward machines-that allow agents to solve extremely complex tasks involving temporal and spatial composition. We demonstrated how skills can be learned and encoded in a specific form of goal-oriented value function that, when combined with the learned skill machines, are sufficient for solving subsequent tasks without further learning. Our approach guarantees that the resulting policy adheres to the logical task specification, which provides assurances of safety and verifiability to the agent's decision making, important characteristics that are necessary if we are to ever deploy RL agents in the real world. While the resulting behaviour is provably satisficing, empirical results demonstrate that the agent's performance is near optimal; further fine-tuning can be performed should optimality be required, which greatly improves the sample efficiency. We see this approach as a step towards truly generally intelligent agents, capable of immediately solving human-specifiable tasks in the real world with no further learning.

A APPENDIX

A.1 PROOFS OF THEORETICAL RESULTS Theorem 1. Let R G be a vector of rewards for each task primitive, and Q * G be the corresponding vector of optimal WVFs. Then, for an MDP m = ⟨S G , A G , P G , R w , γ⟩ with linear preference reward function R w = w • R G , we have Q * m = w • Q * G . Proof.  Q * m (s, g, a) = E π * ∞ t=0 γ t w • RG (s t , g, a t ) = w • E π * ∞ t=0 γ t RG (s t , = w • Q * G Theorem 2. Let π * (s, u) be the optimal policy for a task M T , and let C = P. Then there exists a corresponding skill machine with a w G and w U such that π * (s, u) ∈ arg max a∈A δ Q (u)(⟨s, c⟩, ⟨a, 0⟩), where δ Q is given by w G and w U as per Definition 5. Proof. Let w U (u, •) = 1 N δu where N δu is the number of possible RM transitions from u. Also let w G (s, u, •) be 1 for the set of propositions g ∈ 2 C that are satisfied when following π * (s, u), and zero everywhere else. Then π * (s, u) ∈ arg max a∈A δ Q (u)(⟨s, c⟩, ⟨a, 0⟩) since w U (u, u ′ ) Q * u,u ′ (⟨s, c⟩, g, ⟨a, 0⟩) is optimal using Theorem 1 and optimal policies are assumed to reach task goals. Theorem 3. Let R PA = ⟨U, u 0 , F, δ u , δ M ⟩ be a satisfice-able simple reward machine with non-zero rewards R MAX only for accepting transitions, and for which all valid transitions (u, u ′ ) are achievable from any state s ∈ S. Define the skill machine Q * SA = ⟨U, u 0 , F, δ u , δ Q , w U , w G ⟩ with w U (u, u ′ ) := 1 if u ′ = arg max u ′′ Q * (u, u ′′ ), 0 otherwise w G (⟨s, c⟩, u, g) := 1 if g = arg max g ′ max a u ′ w U (u, u ′ ) Q * u,u ′ (⟨s, c⟩, g ′ , ⟨a, 0⟩), 0 otherwise where Q * is the optimal transition-value function for R PA . Then following the policy π(s, u) ∈ arg max a∈A δ Q (u)(⟨s, c⟩, ⟨a, 0⟩), will reach an accepting transition. Proof. This follows from the optimality of π * (s, u) and Q * , since each transition of the RM is satisfice-able from any environment state. A.2 PSEUDO-CODE FOR FEW-SHOT Q-LEARNING USING SKILL MACHINES Algorithm 1: Few-shot Q-learning using skill machines Input : γ, α, P, C, L, U, u 0 , F, δ u , δ Q Initialise : Q(s, u, a) foreach episode do Observe initial state s ∈ S, get initial u ← u 0 , and c = 0 while episode is not done do / * Using the composed skill δ Q in the behaviour policy * / a ← arg max a∈A (max{γQ(s, u, a), (1 -γ)δ Q (u)(⟨s, c⟩, ⟨a, 0⟩)}) if Bernoulli(1 -ϵ) = 1 a random action otherwise Take action a and observe next state s ′ and true constraints c ← c ∪ (C ∩ L(s, a, s ′ )) Get reward r ← δ r (u)(s, a, s ′ ) and the next RM state u ′ ← δ u (u, L(s, a, s ′ )) Q(s, u, a) α ← -r if s ′ is terminal or u ′ ∈ F else r + γ max a ′ Q(s ′ , u ′ , a ′ ) s ← s ′ A.3 FUNCTION APPROXIMATION WITH CONTINUOUS ACTIONS AND STATES Task Description | LTL 1 Navigate to a button and then to a cylinder. | (F (B ∧ X (F C))) Navigate to a button and then to a cylinder while never entering hazard regions | (F (B ∧ X (F C))) ∧ (G ¬H) Navigate to a button, then to a cylinder without entering hazard regions, then to a button inside a hazard region, and finally to a cylinder again. | F (B ∧ X (F ((C ∧ ¬H) ∧ X (F ((B ∧ H) ∧ X(F H))))))) Navigate to a button and then to a cylinder in a hazard region. | (F (B ∧ X (F C ∧ H))) Navigate to a cylinder, then to a button in a hazard region, and finally to a cylinder again. | (F (C ∧ X (F ((B ∧ H) ∧ X (C)))) Navigate to a hazard, then to a cylinder, and finally to a cylinder again while avoiding hazards. | (F (H ∧ X (F (C ∧ X (F (C ∧ H)))))) Table 3 : Tasks in the Safety AI Gym domains. The RMs are generated from the LTL expressions. We demonstrate our temporal logic composition approach in a Safety AI Gym domain (Figure 5 ) (Ray et al., 2019) which has a continuous state space (S = R 60 ) and continuous action space (A = R 2 ). The agent here is a point mass that needs to navigate to various regions defined by 3 propositions (P = {B, C, H}) corresponding to its 3 lidar sensors for the buttons (B) (grey spheres), the cylinder (C) (translucent cylinder), and the hazards (H) (blue regions). The button and hazard positions are fixed as shown in Figure 5 , the cylinder is randomly placed on one of the buttons, and the agent is randomly placed anywhere on the plane. We first learn the 3 base skill primitives corresponding to each predicate (with constraints C = {H}), with goal oriented Q-learning Nangue Tasse et al. (2020) but using Twin Delayed DDPG (Fujimoto et al., 2018) to update the WVFs. Figure 6 shows the trajectories of the SM policies for the six tasks described in Table 3 . Our results shows that skill primitives can be leveraged to achieve zero-shot temporal logics even in continuous domains.

A.4 DETAILS OF EXPERIMENTAL SETTING

In this section we elaborate further on the hyper-parameters for the various experiments in Section 4. We also describe the pretraining of WVFs for all of the experimental settings which corresponds to learning the base task primitives for each domain. The same hyper-parameters are used for all algorithms in a particular experiment. This is to ensure that we evaluate the relative performance 3 ) using the skill machine without further learning (left) and with further learning (right).



Accepting transitions are transitions at which the high level task-described, for example, by linear temporal logics-is satisfied.



, an agent (blue circle) can move to adjacent cells in any of the cardinal directions. It can also pick up coffee or mail at locations or respectively, and it can deliver them to the office at location . Cells marked ✽ indicate decorations that are broken if the agent collides with them, and cells marked A-D indicate the centres of the corner rooms. The reward machines that specify tasks in this environment are defined over 10 propositions: P = {A, B, C, D, ✽, , , , + , + }, where the first 8 propositions are true when

Figure 1: Illustration of (a) the office gridworld where the blue circle represents the agent and (b)the finite state machine representing both the reward and skill machine for the task "deliver coffee and mail to the office without breaking any decoration" where the black dots labeled t represent terminal states. The reward machine gives rewards (δ r ) to the agent for each FSM state and the skill machine gives the composed skills δ Q that maximises those rewards. For example at u 0 , δ r (u 0 ) = 0.5(R ∧¬✽ ) + 0.5(R

Figure 2: The SRM, value iterated RM and skill machine for the task "Deliver coffee to the office without breaking any decoration". This task is specified using LTL as (F ( ∧X(F )))∧(G ¬✽)), where F = F inally, X = neXt, G = Globally are LTL operators. The corresponding RM is obtained by converting the LTL into a finite state machine (Duret-Lutz et al., 2016) and then giving a reward of 1 for accepting transitions and 0 otherwise. The black dots labeled t represent terminal states.

(i) Office Gridworld Icarte et al. (2018): The tasks here are specified over 10 propositions P = {A, B, C, D, ✽, , , , + , + } and 1 constraint C = {✽}. We learn the base skill primitives Q * P (Figure 7 in Appendix A.5) using goal oriented Q-learning Nangue Tasse et al. (

the office without breaking decorations | F ∧ X F ∧ (G ¬✽) 2 Patrol rooms A, B, C, and D without breaking any decoration | (F (A ∧ X (F (B ∧ X (F (C ∧ X (F D))))))) ∧ (G ¬✽) 3

Figure 3: Average (over 80 independent trials) returns during training in the Office Gridworld.

Figure 4: The Moving Targets domain (left) and the average returns over 100 runs for tasks in Table 2 (right), where B, P, S = , , .

g, a t ) ; since the world policies are independent of task Nangue Tasse et al. (2020)[Lemma 2].

Figure 5: Visualisation of the Safety AI Gym Domain.

{A, B, C, D, ✽, , , , + , + } (reaching states the predicate is set to True), with constraints C = {✽}. All other primitives in this domain can be obtained zero-shot through value function composition. Similarly, for the moving targets domain, the WVFs are pre-trained on the primitives corresponding to obtaining objects by shape or colour in the environment separately, P = { , , , , }, with constraints C = P. From here the value functions for finding objects of particular colours or any more complex primitives can be composed zero-shot. Finally, for the SafeAI Gym environment the base skill primitives correspond to going to a button (B), a cylinder (C), and a hazard (H): P = {B, C, H}, trained with constraints C = {H}. Mails present + (j) People present +

Figure 7: The policies (arrows) and value functions (heat map) of the base primitive tasks in the Office Gridworld. These are obtained by maximising over the goals of the learned WVFs. All errors in the figures are due to training the WVFs for 200000 time steps, hence not to convergence.

Figure 9: Agent trajectories for various tasks in the Office Gridworld (Table3) using the skill machine without further learning (left) and with further learning (right).

Table of hyper-parameters used for Q-learning in the Office World experiments.

Table of hyper-parameters used for Deep Q-learning in the Moving Targets experiments.

Table of hyper-parameters used for the TD3 in the SafeAI Gym experiments.

