HIERARCHIES OF REWARD MACHINES

Abstract

Reward machines (RMs) are a recent formalism for representing the reward function of a reinforcement learning task through a finite-state machine whose edges encode landmarks of the task using high-level events. The structure of RMs enables the decomposition of a task into simpler and independently solvable subtasks that help tackle long-horizon and/or sparse reward tasks. We propose a formalism for further abstracting the subtask structure by endowing an RM with the ability to call other RMs, thus composing a hierarchy of RMs (HRM). We exploit HRMs by treating each call to an RM as an independently solvable subtask using the options framework, and describe a curriculum-based method to learn HRMs from traces observed by the agent. Our experiments reveal that exploiting a handcrafted HRM leads to faster convergence than with a flat HRM, and that learning an HRM is feasible in cases where its equivalent flat representation is not.

1. INTRODUCTION

More than a decade ago, Dietterich et al. (2008) argued for the need to "learn at multiple time scales simultaneously, and with a rich structure of events and durations". Finite-state machines (FSMs) are a simple yet powerful formalism for abstractly representing temporal tasks in a structured manner. One of the most prominent types of FSMs used in reinforcement learning (RL; Sutton & Barto, 2018) are reward machines (RMs; Toro Icarte et al., 2018; 2022) , where each edge is labeled with (i) a formula over a set of high-level events that capture a task's subgoal, and (ii) a reward for satisfying the formula. Hence, RMs fulfill the need for structuring events and durations. Hierarchical reinforcement learning (HRL; Barto & Mahadevan, 2003) frameworks, such as options (Sutton et al., 1999) , have been applied over RMs to learn policies at two levels of abstraction: (i) select a formula (i.e., subgoal) from a given RM state to complete the overall task, and (ii) select an action to satisfy the chosen formula (Toro Icarte et al., 2018; Furelos-Blanco et al., 2021) . Thus, RMs also allow learning at multiple scales simultaneously. The subtask decomposition powered by HRL eases the handling of long-horizon and sparse reward tasks. Besides, RMs can act as an external memory in partially observable tasks by keeping track of the subgoals achieved so far and those to be achieved. In this work, we make the following contributions: 1. Enhance the abstraction power of RMs by defining hierarchies of RMs (HRMs), where constituent RMs can call other RMs (Section 3). We prove that any HRM can be transformed into an equivalent flat HRM that behaves exactly like the original RMs. We show that under certain conditions, the equivalent flat HRM can have exponentially more states and edges. 2. Propose an HRL algorithm to exploit HRMs by treating each call as a subtask (Section 4). Learning policies in HRMs further fulfills the desiderata posed by Dietterich et al. since (i) there is an arbitrary number of time scales to learn across (not only two), and (ii) there is a richer range of increasingly abstract events and durations. Besides, hierarchies enable modularity and, hence, the reusability of the RMs and policies. Empirically, we show that leveraging a handcrafted HRM enables faster convergence than an equivalent flat HRM. 3. Introduce a curriculum-based method for learning HRMs from traces given a set of hierarchically composable tasks (Section 5). In line with the theory (Contribution 1), our experiments reveal that decomposing an RM into several is crucial to make its learning feasible (i.e., the flat HRM cannot be efficiently learned from scratch) since (i) the constituent RMs are simpler (i.e., they have fewer states and edges), and (ii) previously learned RMs can be used to efficiently explore the environment in the search for traces in more complex tasks.

2. BACKGROUND

Given a finite set X , we use ∆(X ) to denote the probability simplex over X , X * to denote (possibly empty) sequences of elements from X , and X + to denote non-empty sequences. We also use ⊥ and ⊤ to denote the truth values false and true, respectively. 1[A] is the indicator function of event A. We represent RL tasks as episodic labeled Markov decision processes (MDPs, Xu et al., 2020) , each consisting of a set of states S, a set of actions A, a transition function p : S × A → ∆(S), a reward function r : (S × A) + × S → R, a discount factor γ ∈ [0, 1), a finite set of propositions P representing high-level events, a labeling function l : S → 2 P mapping states to proposition subsets called labels, and a termination function τ : (S × A) * × S → {⊥, ⊤} × {⊥, ⊤}. Hence the transition function p is Markovian, but the reward function r and termination function τ are not. Given a history h t = ⟨s 0 , a 0 , . . . , s t ⟩ ∈ (S × A) * × S, a label trace (or trace, for short) λ t = ⟨l(s 0 ), . . . , l(s t )⟩ ∈ (2 P ) + assigns labels to all states in h t . We assume (λ t , s t ) captures all relevant information about h t ; thus, the reward and transition information can be written r(h t , a t , s t+1 ) = r(h t+1 ) = r(λ t+1 , s t+1 ) and τ (h t ) = τ (λ t , s t ), respectively. We aim to find a policy π : (2 P ) + × S → A, a mapping from traces-states to actions, that maximizes the expected cumulative discounted reward (or return) R t = E π [ n k=t γ k-t r(λ k+1 , s k+1 )], where n is the last episode's step. At time t, the trace is λ t ∈ (2 P ) + , and the agent observes a tuple s t = ⟨s t , s T t , s G t ⟩, where s t ∈ S is the state and (s T t , s G t ) = τ (λ t , s t ) is the termination information, with s T t and s G t indicating whether or not the history (λ t , s t ) is terminal or a goal, respectively. If the history is non-terminal, the agent runs action a t ∈ A, and the environment transitions to state s t+1 ∼ p(•|s t , a t ). The agent then extends the trace as λ t+1 = λ t ⊕ l(s t+1 ), receives reward r t+1 = r(λ t+1 , s t+1 ), and observes a new tuple s t+1 . A trace λ t is a goal trace if (s T t , s G t ) = (⊤, ⊤), a dead-end trace if (s T t , s G t ) = (⊤, ⊥), and an incomplete trace if s T t = ⊥. We assume that the reward is r(λ t+1 , s t+1 ) = 1[τ (λ t+1 , s t+1 ) = (⊤, ⊤)], i.e. 1 for goal histories and 0 otherwise. A (simple) reward machine (RM; Toro Icarte et al., 2018; 2022 ) is a tuple ⟨U, P, φ, r, u 0 , U A , U R ⟩, where U is a finite set of states; P is a finite set of propositions; φ : U × U → DNF P is a state transition function such that φ(u, u ′ ) denotes the disjunctive normal form (DNF) formula over P to be satisfied to transition from u to u ′ ; r : U ×U → R is a reward transition function such that r(u, u ′ ) is the reward for transitioning from u to u ′ ; u 0 ∈ U is an initial state; U A ⊆ U is a set of accepting states denoting the task's goal achievement; and U R ⊆ U is a set of rejecting states denoting the unfeasibility of achieving the goal. Ideally, RM states should capture traces, such that (i) pairs (u, s) of an RM state and an MDP state are sufficient to predict the future, and (ii) the reward r(u, u ′ ) matches the underlying MDP's reward. The state transition function is deterministic, i.e. at most one formula from each state is satisfied. To verify if a formula is satisfied by a label L ⊆ P, L is used as truth assignment where propositions in L are true, and false otherwise (e.g., {a} |= a ∧ ¬b). Options (Sutton et al., 1999) address temporal abstraction in RL. Given an episodic labeled MDP, an option is a tuple ω = ⟨I ω , π ω , β ω ⟩, where I ω ⊆ S is the option's initiation set, π ω : S → A is the option's policy, and β ω : S → [0, 1] is the option's termination condition. An option is available in s ∈ S if s ∈ I ω , selects actions according to π ω , and terminates in s ∈ S with probability β ω (s).

3. FORMALIZATION OF HIERARCHIES OF REWARD MACHINES

We here introduce our formalism for hierarchically composing reward machines, and propose the CRAFTWORLD domain (cf. Figure 1a ) to illustrate it. In this domain, the agent ( ) can move forward or rotate 90 • , staying put if it moves towards a wall. Locations are labeled with propositions from P = { , , , , , , , , , }. The agent observes propositions that it steps on (e.g., { } in the top-left corner). Table 1 lists tasks that consist of observing a sequence of propositions, where the reward is 1 if the sequence is observed and 0 otherwise. These tasks are based on those by Andreas et al. (2017) and Toro Icarte et al. (2018) , but they can be defined in terms of each other. Reward machines (RMs) are the building blocks of our formalism. To constitute a hierarchy of RMs, we need to endow RMs with the ability to call each other. We redefine the state transition function as φ : U × U × M → DNF P , where M is a set of RMs. The expression φ(u, u ′ , M ) denotes the disjunctive normal form (DNF) formula over P that must be satisfied to transition from u ∈ U to u ′ ∈ U by calling RM M ∈ M. We refer to the formulas φ(u, u ′ , M ) as contexts (a) u 0 0 u 1 0 u 2 0 u 3 0 u A 0 M 1 | ¬ M 2 | ⊤ M 2 | ⊤ M 1 | ⊤ M ⊤ | M 0 (root) u 0 1 u 1 1 u A 1 M ⊤ | M ⊤ | M 1 u 0 2 u 1 2 u A 2 M ⊤ | M ⊤ | M 2 (b) u 0 u 1 u 2 u 3 u 4 u 5 u 6 u A ∧ ¬ (c) Figure 1 : A CRAFTWORLD grid (a), an HRM for BOOK (b), and an equivalent flat HRM (c). An edge from state u to u ′ of an RM M i is of the form M j | φ i (u, u ′ , M j ), double circled states are accepting states, and loop transitions are omitted. Calls to the leaf RM M ⊤ are omitted in (c). Table 1 : List of CRAFTWORLD tasks. Descriptions "x ; y" express sequential order (observe/do x then y), and descriptions "x & y" express that x and y can be observed/done in any order. since they represent conditions under which calls are made. As we will see later, contexts help preserve determinism and must be satisfied to start a call (a necessary but not sufficient condition). The hierarchies we consider contain an RM M ⊤ called the leaf RM, which solely consists of an accepting state (i.e., U ⊤ = U A ⊤ = {u 0 ⊤ }), and immediately returns control to the RM that calls it. Definition 1. A hierarchy of reward machines (HRM) is a tuple H = ⟨M, M r , P⟩, where M = {M 0 , . . . , M m-1 } ∪ {M ⊤ } is a set of m RMs and the leaf RM M ⊤ , M r ∈ M \ {M ⊤ } is the root RM, and P is a finite set of propositions used by all constituent RMs. We make the following assumptions: (i) HRMs do not have circular dependencies (i.e., an RM cannot be called back from itself, including recursion), (ii) rejecting states are global (i.e., cause the root task to fail), (iii) accepting and rejecting states do not have transitions to other states, and (iv) the reward function of the root corresponds to the reward obtained in the underlying MDP. Given assumption (i), each RM M i has a height h i , which corresponds to the maximum number of nested calls needed to reach the leaf. Formally, if i = ⊤, then h i = 0; otherwise, h i = 1 + max j h j , where j ranges over all RMs called by M i (i.e., there exists (u, v) ∈ U i × U i such that φ i (u, v, M j ) ̸ = ⊥). Figure 1b shows BOOK's HRM, whose root has height 2. The PAPER and LEATHER RMs, which have height 1 and consist of observing a two-proposition sequence, can be run in any order followed by observing . The context ¬ in the call to M 1 preserves determinism, as detailed later. In the following paragraphs, we describe how an HRM processes a label trace. To indicate where the agent is in an HRM, we define the notion of hierarchy states. Definition 2. Given an HRM H = ⟨M, M r , P⟩, a hierarchy state is a tuple ⟨M i , u, Φ, Γ⟩, where M i ∈ M is an RM, u ∈ U i is a state, Φ ∈ DNF P is an accumulated context, and Γ is a call stack. Definition 3. Given an HRM H = ⟨M, M r , P⟩, a call stack Γ contains tuples ⟨u, v, M i , M j , ϕ, Φ⟩, each denoting a call where u ∈ U i is the state from which the call is made; v ∈ U i is the next state in the calling RM M i ∈ M after reaching an accepting state of the called RM M j ∈ M; ϕ ∈ DNF P are the disjuncts of φ i (u, v, M j ) satisfied by a label; and Φ ∈ DNF P is the accumulated context. Call stacks determine where to resume the execution. Each RM appears in the stack at most once since, by assumption, HRMs have no circular dependencies. We use Γ ⊕ ⟨u, v, M i , M j , ϕ, Φ⟩ to denote a stack recursively defined by a stack Γ and a top element ⟨u, v, M i , M j , ϕ, Φ⟩, where the accumulated context Φ is the condition under which a call from a state u is made. The initial hierarchy state of an HRM H = ⟨M, M r , P⟩ is ⟨M r , u 0 r , ⊤, []⟩: we are in the initial state of the root, there is no accumulated context, and the stack is empty. At the beginning of this section, we mention that satisfying the context of a call is a necessary but not sufficient condition to start the call. We now introduce a sufficient condition, called exit condition. Definition 4. Given an HRM H = ⟨M, M r , P⟩ and a hierarchy state ⟨M i , u, Φ, Γ⟩, the exit condition ξ i,u,Φ ∈ DNF P is the formula that must be satisfied to leave that hierarchy state. Formally, ξ i,u,Φ =    Φ if i = ⊤, ϕ=φi(u,v,Mj ), ϕ̸ =⊥,v∈Ui,Mj ∈M ξ j,u 0 j ,DNF (Φ∧ϕ) otherwise, where DNF(Φ ∧ ϕ) is Φ ∧ ϕ in DNF. The formula is Φ if M i = M ⊤ since it always returns control once called. Otherwise, the formula is recursively defined as the disjunction of the exit conditions from the initial state of the called RM. For instance, the exit condition for the initial hierarchy state in Figure 1b is (¬ ∧ ) ∨ . We now have everything needed to define the hierarchical transition function δ H , which maps a hierarchy state ⟨M i , u, Φ, Γ⟩ into another given a label L. There are three cases: 1. If u is an accepting state of M i and the stack Γ is non-empty, pop the top element of Γ and return control to the previous RM, recursively applying δ H in case several accepting states are reached simultaneously. Formally, the next hierarchy state is δ  H (⟨M j , u ′ , ⊤, Γ ′ ⟩, ⊥) if u ∈ U A i , |Γ| > 0, where Γ = Γ ′ ⊕ ⟨•, u ′ , M j , M i , •, •⟩, ⊥ (⟨M j , u 0 j , Φ ′ , Γ ⊕ ⟨u, u ′ , M i , M j , ϕ, Φ⟩⟩, L) if L |= ξ j,u 0 j ,Φ ′ , where ϕ = φ i (u, u ′ , M j )(L) and Φ ′ = DNF(Φ ∧ ϕ). Here, φ(L) denotes the disjuncts of a DNF formula φ ∈ DNF P satisfied by L. 3. If none of the previous conditions holds, the hierarchy state remains unchanged. The state transition functions φ of the RMs must be such that δ H is deterministic, i.e. a label cannot simultaneously satisfy the contexts and exit conditions associated with two triplets ⟨u, v, M i ⟩ and ⟨u, v ′ , M j ⟩ such that either (i) v = v ′ and i ̸ = j, or (ii) v ̸ = v ′ . Contexts help enforce determinism by making formulas mutually exclusive. For instance, if the call to M 1 from the initial state of M 0 in Figure 1b had context ⊤ instead of ¬ , then M 1 and M 2 could be both started if { , } was observed, thus making the HRM non-deterministic. Finally, we introduce hierarchy traversals, which determine how a label trace is processed by an HRM using δ H . Definition 5. Given a label trace λ = ⟨L 0 , . . . , L n ⟩, a hierarchy traversal H(λ) = ⟨v 0 , v 1 , . . . , v n+1 ⟩ is a unique sequence of hierarchy states such that (i) v 0 = ⟨M r , u 0 r , ⊤, []⟩, and (ii) δ H (v i , L i ) = v i+1 for i = 0, . . . , n. An HRM H accepts λ if v n+1 = ⟨M r , u, ⊤, []⟩ and u ∈ U A r (i.e., an accepting state of the root is reached). Analogously, H rejects λ if v n+1 = ⟨M k , u, •, •⟩ and u ∈ U R k for any k ∈ [0, m -1] (i.e., a rejecting state in the HRM is reached). Example 1. The HRM in Figure 1b accepts label trace λ = ⟨{ }, { }, {}, { }, { }, { }⟩ since the traversal is H(λ) = ⟨⟨M 0 , u 0 0 , ⊤, []⟩, ⟨M 1 , u 1 1 , ⊤, [⟨u 0 0 , u 1 0 , M 0 , M 1 , ¬ , ⊤⟩]⟩, ⟨M 0 , u 1 0 , ⊤, []⟩, ⟨M 0 , u 1 0 , ⊤, []⟩, ⟨M 2 , u 1 2 , ⊤, [⟨u 1 0 , u 3 0 , M 0 , M 2 , ⊤, ⊤⟩]⟩, ⟨M 0 , u 3 0 , ⊤, []⟩, ⟨M 0 , u A 0 , ⊤, []⟩⟩. The step- by-step application of the hierarchical transition function δ H is shown in Appendix A. The behavior of an HRM H can be reproduced by an equivalent flat HRM H; that is, (i) the root of H has height 1 and, (ii) H accepts a trace iff H accepts it, rejects a trace iff H rejects it, and neither accepts nor rejects a trace iff H does not accept it nor reject it. Flat HRMs thus capture the original RM definition. Figure 1c shows a flat HRM for the BOOK task. We formally define equivalence and prove the equivalence theorem below by construction in Appendix B.1. Theorem 1. Given an HRM H, there exists an equivalent flat HRM H. Given the construction used in Theorem 1, we show that the number of states and edges of the resulting flat HRM can be exponential in the height of the root (see Theorem 2). We prove this result in Appendix B.2 through an instance of a general HRM parametrization where the constituent RMs are highly reused, hence illustrating the convenience of HRMs to succinctly compose existing knowledge. In line with the theory, learning a non-flat HRM can take a few seconds, whereas learning an equivalent flat HRM is often unfeasible (see Section 6). Theorem 2. Let H = ⟨M, M r , P⟩ be an HRM and let h r be the height of its root M r . The number of states and edges in an equivalent flat HRM H can be exponential in h r .

4. POLICY LEARNING IN HIERARCHIES OF REWARD MACHINES

In what follows, we explain how to exploit the temporal structure of an HRM H = ⟨M, M r , P⟩ using two types of options. We describe (i) how to learn the policies of these options, (ii) when these options terminate, and (iii) an option selection algorithm that ensures the currently running options and the current hierarchy state are aligned. We discuss implementation details in Appendix C. Types. Given an RM M i ∈ M, a state u ∈ U i and a context Φ, an option ω j,ϕ i,u,Φ is derived for each non-false disjunct ϕ of each transition φ i (u, v, M j ), where v ∈ U i and M j ∈ M. An option is either (i) a formula option if j = ⊤ (i.e., M ⊤ is called), or (ii) a call option otherwise. A formula option attempts to reach a label that satisfies ϕ ∧ Φ through primitive actions, whereas a call option aims to reach an accepting state of the called RM M j under context ϕ ∧ Φ by invoking other options. Policies. Policies are ϵ-greedy during training, and greedy during evaluation. A formula option's policy is derived from a Q-function q ϕ∧Φ (s, a; θ ϕ∧Φ ) approximated by a deep Q-network (DQN; Mnih et al., 2015) with parameters θ ϕ∧Φ , which outputs the Q-value of each action given an MDP state. We store all options' experiences (s t , a, s t+1 ) in a single replay buffer D, thus performing intra-option learning (Sutton et al., 1998) . The Q-learning update uses the following loss function: E (st,a,st+1)∼D r ϕ∧Φ (s t+1 ) + γ max a ′ q ϕ∧Φ (s t+1 , a ′ ; θ - ϕ∧Φ ) -q ϕ∧Φ (s t , a; θ ϕ∧Φ ) 2 , where r ϕ∧Φ (s t+1 ) = 1[l(s t+1 ) |= ϕ ∧ Φ], i.e. the reward is 1 if ϕ ∧ Φ is satisfied and 0 otherwise; the term q ϕ∧Φ (s t+1 , a ′ ; θ - ϕ∧Φ ) is 0 when ϕ ∧ Φ is satisfied or a dead-end is reached (i.e., s T t+1 = ⊤ and s G t+1 = ⊥); and θ - ϕ∧Φ are the parameters of a fixed target network. A call option's policy is induced by a Q-function q i (s, u, Φ, ⟨M j , ϕ⟩; θ i ) associated with the called RM M i and approximated by a DQN with parameters θ i that outputs the Q-value of each call in the RM given an MDP state, an RM state and a context. We store experiences (s t , ω j,ϕ i,u,Φ , s t+k ) in a replay buffer D i associated with M i , and perform SMDP Q-learning using the following loss: E (s t ,ω j,ϕ i,u,Φ ,s t+k )∼D i r + γ k max j ′ ,ϕ ′ q i (s t+k , u ′ , Φ ′ , ⟨M j ′ , ϕ ′ ⟩; θ - i ) -q i (s t , u, Φ, ⟨M j , ϕ⟩; θ i ) 2 , where k is the number of steps between s t and s t+k ; r is the sum of discounted rewards during this time; u ′ and Φ ′ are the RM state and context after running the option; M j ′ and ϕ ′ correspond to an outgoing transition from u ′ , i.e. ϕ ′ ∈ φ i (u ′ , •, M j ′ ); and θ - i are the parameters of a fixed target network. The term q i (s t+k , . . .) is 0 if u ′ is accepting or rejecting. Following the definition of δ H , Φ ′ is ⊤ if the hierarchy state changes; thus, Φ ′ = ⊤ if u ′ ̸ = u, and Φ ′ = Φ otherwise. Following our assumption on the MDP reward, we define reward transition functions as r i (u, u ′ ) = 1[u / ∈ U A i ∧ u ′ ∈ U A i ]. Learning a call option's policy and lower-level option policies simultaneously can be unstable due to non-stationarity (Levy et al., 2019) , e.g. the same lower-level option may only sometimes achieve its goal. To relax this problem, experiences are added to the buffer only when options achieve their goal (i.e., call options assume low-level options to terminate successfully). Due to the hierarchical structure, the policies will be recursively optimal (Dietterich, 2000) at best. Termination. An option terminates in two cases. First, if the episode ends in a goal state or in a dead-end state. Second, if the hierarchy state changes and either successfully completes the option or interrupts the option. Concretely, a formula option ω j,ϕ i,u,Φ is only applicable in a hierarchy state ⟨M i , u, Φ, Γ⟩, while a call option ω j,ϕ i,u,Φ always corresponds to a stack item ⟨u, •, M i , M j , ϕ, Φ⟩. We can thus analyze the hierarchy state to see if an option is still executing or should terminate. Algorithm. An option stack Ω H stores the currently executing options. Initially, Ω H is empty. At each step, Ω H is filled by repeatedly choosing options starting from the current hierarchy state using call option policies until a formula option is selected. Since HRMs have, by assumption, no circular dependencies, a formula option will eventually be chosen. After an action is selected using the formula option's policy and applied, the DQNs associated with formula options are updated. The new hierarchy state is then used to determine which options in Ω H have terminated. Experiences for the terminated options that achieved their goal are pushed into the corresponding buffers, and the DQNs associated with the call options are updated. Finally, Ω H is updated to match the call stack of the new hierarchy state (if needed) by mapping each call stack item into an option, and adding it to Ω H if it is not already there. By aligning the option stack with the call stack, we can update DQNs for options that ended up being run in hindsight and which would have been otherwise ignored.

5. LEARNING HIERARCHIES OF REWARD MACHINES FROM TRACES

In the previous section, we explained how a given HRM can be exploited using options; however, engineering an HRM is impractical. We here describe LHRM, a method that interleaves policy learning with HRM learning from interaction. We consider a multi-task setting. Given T tasks and I instances (e.g., grids) of an environment, the agent learns (i) an HRM for each task using traces from several instances for better accuracy, and (ii) general policies to reach the goal in each task-instance pair. Namely, the agent interacts with T × I MDPs M ij , where i ∈ [1, T ] and j ∈ [1, I]. The learning proceeds from simpler to harder tasks such that HRMs for the latter build on the former. In what follows, we detail the components of LHRM. We assume that (i) all MDPs share propositions P and actions A, while those defined on a given instance share states S and labeling function l; (ii) to stabilize policy learning, dead-end traces must be common across tasks;foot_0 (iii) the root's height of a task's HRM (or task level, for brevity) is known (see Table 1 for CRAFTWORLD); and (iv) without loss of generality, each RM has a single accepting state and a single rejecting state. Curriculum Learning (Bengio et al., 2009) . LHRM learns the tasks' HRMs from lower to higher levels akin to Pierrot et al. (2019) . Before starting an episode, LHRM selects an MDP M ij , where i ∈ [1, T ] and j ∈ [1, I]. The probability of selecting an MDP M ij is determined by an estimate of its average undiscounted return R ij such that lower returns are mapped into higher probabilities (see details in Appendix D). Initially, only level 1 MDPs can be chosen. When the minimum average return across MDPs up to the current level surpasses a given threshold, the current level increases by 1, hence ensuring the learned HRMs and their associated policies are reusable in higher level tasks. Learning an HRM. The learning of an HRM is analogous to learning a flat RM (Toro Icarte et al., 2019; Xu et al., 2020; Furelos-Blanco et al., 2021; Hasanbeig et al., 2021) . The objective is to learn the state transition function φ r of the root M r with height h r given (i) a set of states U r , (ii) a set of label traces Λ = Λ G ∪ Λ D ∪ Λ I , (iii) a set of propositions P, (iv) a set of RMs M with lower heights than h r , (v) a set of callable RMs M C ⊆ M (by default, M C = M), and (vi) the maximum number of disjuncts κ in the DNF formulas labeling the edges. The learned state transition function φ r is such that the resulting HRM H = ⟨M ∪ {M r }, M r , P⟩ accepts all goal traces Λ G , rejects all deadend traces Λ D , and neither accepts or rejects incomplete traces Λ I . The transition functions can be represented as sets of logic rules, which are learned using the ILASP (Law et al., 2015) inductive logic programming system (see Appendix E for details on the ILASP encoding). Interleaving Algorithm. LHRM interleaves the induction of HRMs with policy learning akin to Furelos-Blanco et al. (2021) . Initially, the HRM's root of each task i ∈ [1, T ] consists of 3 states (the initial, accepting, and rejecting states) and neither accepts nor rejects anything. A new HRM is learned when an episode's label trace is not correctly recognized by the current HRM (i.e., if a goal trace is not accepted, a dead-end trace is not rejected, or an incomplete trace is accepted or rejected). The number of states in U r increases by 1 when an HRM that covers the examples cannot be learned, hence guaranteeing that the root has the smallest possible number of states (i.e., it is minimal) for a specific value of κ. When an HRM for task i is learned, the returns R ij in the curriculum are set to 0 for all j ∈ [1, I]. Analogously to some RM learning methods (Toro Icarte et al., 2019; Xu et al., 2020; Hasanbeig et al., 2021) , the first HRM for a task is learned using a set of traces; in our case, the ρ s shortest traces from a set of ρ goal traces are used (empirically, short traces speed up learning). Finally, LHRM leverages learned options to explore the environment during the collection of the ρ goal traces, speeding up the process when labels are sparse. Specifically, options from lower height RMs are sequentially selected uniformly at random, and their greedy policy is run until termination.

6. EXPERIMENTAL RESULTS

We evaluate the policy and HRM learning components of our approach using two domains described below. We report the average performances across 5 runs, each consisting of a different set of 10 random instances. Learning curves show the average undiscounted return obtained by the greedy policy every 100 episodes across instances. For other metrics (e.g., learning times), we present the average and the standard error, with the latter in brackets. In HRM learning experiments, we set a 2-hour timeout to learn the HRMs. See Appendix F for experimental details and extended results. Domains. We consider four grid types for the CRAFTWORLD domain introduced in Section 3: an open plan 7 × 7 grid (OP, Figure 1a ), an open plan 7 × 7 grid with a lava location (OPL), a 13 × 13 four rooms grid (FR; Sutton et al., 1999) , and a 13 × 13 four rooms grid with a lava location per room (FRL). The lava proposition must always be avoided. WATERWORLD (Karpathy, 2015; Sidor, 2016; Toro Icarte et al., 2018) consists of a 2D box containing 12 balls of 6 different colors (2 per color) each moving at a constant speed in a fixed direction. The agent ball can change its velocity in any cardinal direction. The propositions P = {r, g, b, c, m, y} are the balls' colors. Labels consist of the color of the balls the agent overlaps with and, unlike CRAFTWORLD, they may contain multiple propositions. The tasks consist in observing color sequences. We consider two settings: without dead-ends (WOD) and with dead-ends (WD). In WD, the agent must avoid 2 balls of an extra color. Policy Learning in Handcrafted HRMs. We compare the performance of policy learning in handcrafted non-flat HRMs against in flat equivalents. Remember that an equivalent flat HRM always exists for any HRM (see Theorem 1). For fairness, the flat HRMs are minimal. Figure 2 shows the learning curves for some CRAFTWORLD tasks in the FRL setting. The convergence rate is similar in the simplest task (MILKBUCKET), but higher for non-flat HRMs in the hardest ones. As both approaches use the same set of formula option policies, differences arise from the lack of modularity in flat HRMs. Call options, which are not present in flat HRMs, constitute independent modules that help reduce reward sparsity. MILKBUCKET involves less high-level steps than BOOKQUILL and CAKE, thus reward is less sparse and non-flat HRMs are not as beneficial. The effectiveness of non-flat HRMs is also limited when (i) the task's goal is reachable regardless of the chosen options (e.g., if the are no edges to rejecting states, like in OP and FR), and (ii) the reward is not too sparse, like in OPL (the grid is small) or WATERWORLD (the balls can easily get near the agent). Learning of Non-Flat HRMs. Figure 3 shows the LHRM learning curves for CRAFTWORLD (FRL) and WATERWORLD (WD). These settings are the most challenging due to the inclusion of dead-ends since (i) they hinder the observation of goal examples in level 1 tasks using random walks, (ii) the RMs must include rejecting states, (iii) formula options must avoid dead-ends, and (iv) call options must avoid invoking options leading to rejecting states. In line with the curriculum method, LHRM does not start learning a task of a given level until tasks in previous levels are mastered. The convergence for high-level tasks is often fast due to the reuse of lower level HRMs and policies. The average time (in seconds) exclusively spent on learning all HRMs is 1009.8 (122.3) for OP, 1622.6 (328.7) for OPL, 1031 .6 (150.3) for FR, 1476.8 (175.3) for FRL, 35.4 (2.0) for WOD, and 67.0 (6.2) for WD. Including dead-ends (OPL, FRL, WD) incurs longer executions since (i) there is one more proposition, (ii) there are edges to the rejecting state(s), and (iii) there are dead-end traces to cover. We observe that the complexity of learning an HRM does not necessarily correspond with the task complexity (e.g., the times for OP and FRL are similar). Learning in WATERWORLD is faster than in CRAFTWORLD since the RMs have fewer states and there are fewer callable RMs.  [r ; g] BC [b ; c] MY [m ; y] --------- RG&BC [RG & BC] BC&MY [BC & MY] RG&MY [RG & MY] RGB [RG ; b] CMY [c ; MY] --------- RGB&CMY [RGB & CMY] Figure 3 : LHRM learning curves for CRAFTWORLD (FRL) and WATERWORLD (WD). The legend in WATERWORLD separates tasks by level, and the subtask order (in brackets) follows that introduced in Table 1 . The dotted vertical lines correspond to episodes in which an HRM is learned. By restricting the callable RMs to those required by the HRM (e.g., using just PAPER and LEATHER RMs to learn BOOK's HRM), there are fewer ways to label the edges of the induced root. Learning is 5-7× faster using 20% fewer calls to the learner (i.e., fewer examples) in CRAFTWORLD, and 1.5× faster in WATERWORLD; hence, HRM learning becomes less scalable as the number of tasks and levels grows. This is an instance of the utility problem (Minton, 1988) . Refining the callable RMs set 'a priori' to speed up HRM learning is a direction for future work. We evaluate the performance of exploration with options using the number of episodes needed to collect the ρ goal traces for a given task since the activation of its level. Intuitively, the agent will rarely move far from a region of the state space using primitive actions only, thus taking more time to collect the traces; in contrast, options enable the agent to explore the state space more efficiently. In the FRL setting of CRAFTWORLD, we observe that using primitive actions requires 128.1× more episodes than options in MILKBUCKET, the only level 2 task for which ρ traces are collected (although in just 2/5 runs). Likewise, primitive actions take 20.8× and 7.7× more episodes in OPL and WD respectively. In OP and WOD options are not beneficial since episodes are relatively long (1,000 steps), there are no dead-ends and it is easy to observe the different propositions. Finally, we observe that using a single goal trace to learn the first HRMs (ρ = ρ s = 1) incurs timeouts across all CRAFTWORLD settings, thus showing the value of using many short traces instead. Learning Flat HRMs. Learning a flat HRM is often less scalable than learning a non-flat equivalent since (i) previously learned HRMs cannot be reused, and (ii) the flat HRM usually has more states and edges (as shown in Theorem 2, the growth can be exponential). We compare the performance of learning (from interaction) a non-flat HRM using LHRM with that of an equivalent flat HRM using LHRM, DeepSynth (Hasanbeig et al., 2021) , LRM (Toro Icarte et al., 2019) and JIRP (Xu et al., 2020) . Akin to LHRM, JIRP induces RMs with explicit accepting states, while DeepSynth and LRM do not. We use OP and WOD instances for CRAFTWORLD and WATERWORLD respectively. The non-flat HRM for MILKBUCKET is learned in 1.5 (0.2) seconds, whereas the flat HRMs take longer to learn: 3.2 (0.6) w/LHRM, 325.6 (29.7) w/DeepSynth, 347.5 (64.5) w/LRM and 17.1 (5.5) w/JIRP. LHRM and JIRP learn minimal RMs, hence producing the same RM consisting of 4 states and 3 edges. DeepSynth and LRM do not learn a minimal RM but one that is good at predicting the next possible label given the current one. In domains like ours where propositions can be observed anytime (i.e., without temporal dependencies between them), these methods tend to 'overfit' the input traces and produce large outputs that barely reflect the task's structure, e.g. DeepSynth learns RMs with 13.4 (0.4) states and 93.2 (1.7) edges. In contrast, methods learning minimal machines exclusively from observable traces may suffer from overgeneralization (Angluin, 1980) in other domains (e.g., with temporally-dependent propositions). In more complex tasks such as BOOK, LHRM learns the non-flat HRM (see Figure 1b ) in 191.2 (36.4) seconds, whereas methods learning the flat HRM (see Figure 1c ) usually time out or, in the case of DeepSynth, learn bigger representations. The performance of DeepSynth, LRM and JIRP is poor in WATERWORLD since they all learn RMs whose edges are labeled with proposition sets instead of formulas, unlike LHRM; thus, the RMs may require exponentially more edges, motivating the use of formulas for abstraction. For instance, the flat HRM for RG requires 64 edges instead of 2, and only LHRM and JIRP can learn it on time. All flat HRM learners time out in RG&BC, whereas the non-flat HRM is learned in 4.5 (0.3) seconds.

7. RELATED WORK

RMs and Composability. Our RMs differ from the original ones (Toro Icarte et al., 2018; 2022) in that (i) an RM can call other RMs, (ii) there are explicit accepting and rejecting states (Xu et al., 2020; Furelos-Blanco et al., 2021) , and (iii) transitions are labeled with propositional logic formulas instead of proposition sets (Furelos-Blanco et al., 2021) . Recent works derive RMs (and similar FSMs) from formal language specifications (Camacho et al., 2019; Araki et al., 2021) and expert demonstrations (Camacho et al., 2021) , or learn them from experience using discrete optimization (Toro Icarte et al., 2019) , SAT solving (Xu et al., 2020) , active learning (Gaon & Brafman, 2020; Xu et al., 2021) , state-merging (Xu et al., 2019; Gaon & Brafman, 2020) , program synthesis (Hasanbeig et al., 2021) or inductive logic programming (Furelos-Blanco et al., 2021) . Prior ways of composing RMs include (i) merging the state and reward transition functions (De Giacomo et al., 2020) , and (ii) encoding a multi-agent task using an RM, decomposing it into one RM per agent and executing them in parallel (Neary et al., 2021) . Task composability has also been modeled using subtask sequences called sketches (Andreas et al., 2017) , context-free grammars defining a subset of English (Chevalier-Boisvert et al., 2019) , formal languages (Jothimurugan et al., 2019; León et al., 2020; Wang et al., 2020) and logic-based algebras (Nangue Tasse et al., 2020) . Hierarchical RL. Our method for exploiting HRMs resembles a hierarchy of DQNs (Kulkarni et al., 2016) . Akin to option discovery methods, LHRM induces a set of options from experience. LHRM requires a set of propositions and tasks, which bound the number of discoverable options; similarly, some of these methods impose an explicit bound (Bacon et al., 2017; Machado et al., 2017) . LHRM requires each task to be solved at least once before learning an HRM (and, hence, options), just like other methods (McGovern & Barto, 2001; Stolle & Precup, 2002) . The problem of discovering options for exploration has been considered before (Bellemare et al., 2016; Machado et al., 2017; Jinnai et al., 2019; Dabney et al., 2021) . While our options are not explicitly discovered for exploration, we leverage them to find goal traces in new tasks. Levy et al. (2019) learn policies from multiple hierarchical levels in parallel by training each level as if the lower levels were optimal; likewise, we train call option policies from experiences where invoked options achieve their goal. HRMs are close to hierarchical abstract machines (HAMs; Parr & Russell, 1997) since both are hierarchies of FSMs; however, there are two core differences. First, HAMs do not have reward transition functions. Second, (H)RMs decouple the traversal from the policies, i.e. independently of the agent's choices, the (H)RM is followed; thus, an agent exploiting an (H)RM must be able to interrupt its choices (see Section 4). While HAMs do not support interruption, Programmable HAMs (Andre & Russell, 2000) extend HAMs to support it along with other program-like features. Despite the resemblance, there are few works on learning HAMs (Leonetti et al., 2012) and there are many on learning RMs, hence showing (H)RMs are more amenable (yet expressive) to learning. Curriculum Learning. Pierrot et al. (2019) learn hierarchies of neural programs given the level of each program, akin to our RMs' height; likewise, Andreas et al. (2017) prioritize tasks consisting of fewer high-level steps. The 'online' method by Matiisen et al. (2020) also keeps an estimate of each task's average return, but it is not applied in an HRL scenario. Wang et al. (2020) learn increasingly complex temporal logic formulas leveraging previously learned formulas using a set of templates.

8. CONCLUSIONS

We have here proposed (1) HRMs, a formalism that composes RMs in a hierarchy by enabling them to call each other, (2) an HRL method that exploits the structure of an HRM, and (3) a curriculumbased method for learning a collection of HRMs from traces. Non-flat HRMs have significant advantages over their flat equivalents. Theoretically, we have proved that the flat equivalent of a given HRM can have exponentially more states and edges. Empirically, we have shown that (i) our HRL method converges faster given a non-flat HRM instead of a flat equivalent one, and (ii) in line with the theory, learning an HRM is feasible in cases where its flat equivalent is not. LHRM assumes that the proposition set is known, dead-end indicators are shared across tasks, there is a fixed set of tasks and the height for each HRM is provided. Relaxing these assumptions by forming the propositions from raw data, conditioning policies to dead-ends, and letting the agent propose its own composable tasks are promising directions for future work. Other interesting extensions include non-episodic settings and methods for learning globally optimal policies over HRMs.

REPRODUCIBILITY

To make our work more understandable and reproducible, we provide pseudo-code, proofs and examples throughout the paper. We here outline the content covered in the appendices that help with reproducibility: A HIERARCHY TRAVERSAL EXAMPLE The HRM in Figure 1b accepts trace λ = ⟨{ }, { }, {}, { }, { }, { }⟩, whose traversal is H(λ) = ⟨v 0 , v 1 , v 2 , v 3 , v 4 , v 5 , v 6 ⟩ , where: v 0 = ⟨M 0 , u 0 0 , ⊤, []⟩, v 1 = δ H (v 0 , { }) = δ H (⟨M 0 , u 0 0 , ⊤, []⟩, { }) = δ H (⟨M 1 , u 0 1 , ¬ , [⟨u 0 0 , u 1 0 , M 0 , M 1 , ¬ , ⊤⟩]⟩, { }) = δ H (⟨M ⊤ , u 0 ⊤ , ¬ ∧ , [⟨u 0 0 , u 1 0 , M 0 , M 1 , ¬ , ⊤⟩, ⟨u 0 1 , u 1 1 , M 1 , M ⊤ , , ¬ ⟩]⟩, { }) = δ H (⟨M 1 , u 1 1 , ⊤, [⟨u 0 0 , u 1 0 , M 0 , M 1 , ¬ , ⊤⟩]⟩, ⊥) = ⟨M 1 , u 1 1 , ⊤, [⟨u 0 0 , u 1 0 , M 0 , M 1 , ¬ , ⊤⟩]⟩, v 2 = δ H (v 1 , { }) = δ H (⟨M 1 , u 1 1 , ⊤, [⟨u 0 0 , u 1 0 , M 0 , M 1 , ¬ , ⊤⟩]⟩, { }) = δ H (⟨M ⊤ , u 0 ⊤ , , [⟨u 0 0 , u 1 0 , M 0 , M 1 , ¬ , ⊤⟩, ⟨u 1 1 , u A 1 , M 1 , M ⊤ , , ⊤⟩]⟩, { }) = δ H (⟨M 1 , u A 1 , ⊤, [⟨u 0 0 , u 1 0 , M 0 , M 1 , ¬ , ⊤⟩]⟩, ⊥) = δ H (⟨M 0 , u 1 0 , ⊤, []⟩, ⊥) = ⟨M 0 , u 1 0 , ⊤, []⟩, v 3 = δ H (v 2 , {}) = δ H (⟨M 0 , u 1 0 , ⊤, []⟩, {}) = ⟨M 0 , u 1 0 , ⊤, []⟩, v 4 = δ H (v 3 , { }) = δ H (⟨M 0 , u 1 0 , ⊤, []⟩, { }) = δ H (⟨M 2 , u 0 2 , ⊤, [⟨u 1 0 , u 3 0 , M 0 , M 2 , ⊤, ⊤⟩]⟩, { }) = δ H (⟨M ⊤ , u 0 ⊤ , , [⟨u 1 0 , u 3 0 , M 0 , M 2 , ⊤, ⊤⟩, ⟨u 0 2 , u 1 2 , M 2 , M ⊤ , , ⊤⟩]⟩, { }) = δ H (⟨M 2 , u 1 2 , ⊤, [⟨u 1 0 , u 3 0 , M 0 , M 2 , ⊤, ⊤⟩]⟩, ⊥) = ⟨M 2 , u 1 2 , ⊤, [⟨u 1 0 , u 3 0 , M 0 , M 2 , ⊤, ⊤⟩]⟩, v 5 = δ H (v 4 , { }) = δ H (⟨M 2 , u 1 2 , ⊤, [⟨u 1 0 , u 3 0 , M 0 , M 2 , ⊤, ⊤⟩]⟩, { }) = δ H (⟨M ⊤ , u 0 ⊤ , , [⟨u 1 0 , u 3 0 , M 0 , M 2 , ⊤, ⊤⟩, ⟨u 1 2 , u A 2 , M 2 , M ⊤ , , ⊤⟩]⟩, { }) = δ H (⟨M 2 , u A 2 , ⊤, [⟨u 1 0 , u 3 0 , M 0 , M 2 , ⊤, ⊤⟩]⟩, ⊥) = δ H (⟨M 0 , u 3 0 , ⊤, []⟩, ⊥) = ⟨M 0 , u 3 0 , ⊤, []⟩, v 6 = δ H (v 5 , { }) = δ H (⟨M 0 , u 3 0 , ⊤, []⟩, { }) = δ H (⟨M ⊤ , u 0 ⊤ , , [⟨u 3 0 , u A 0 , M 0 , M ⊤ , , ⊤⟩]⟩, { }) = δ H (⟨M 0 , u A 0 , ⊤, []⟩, ⊥) = ⟨M 0 , u A 0 , ⊤, []⟩. Example 1 shows the traversal but omits the intermediate applications of the hierarchical transition function δ H .

B EQUIVALENCE TO FLAT HIERARCHIES OF REWARD MACHINES

In this section, we prove the theorems introduced in Section 3 regarding the equivalence of an arbitrary HRM to a flat HRM.

B.1 PROOF OF THEOREM 1

We formally show that any HRM can be transformed into an equivalent one consisting of a single non-leaf RM. The latter HRM type is called flat since there is a single hierarchy level. Definition 6. Given an HRM H = ⟨M, M r , P⟩, a constituent RM M i ∈ M is flat if its height h i is 1. Definition 7. An HRM H = ⟨M, M r , P⟩ is flat if the root RM M r is flat. We now define what it means for two HRMs to be equivalent. This definition is based on that used in automaton theory (Sipser, 1997) . Definition 8. Given a set of propositions P and a labeling function l, two HRMs H = ⟨M, M r , P⟩ and H ′ = ⟨M ′ , M ′ r , P⟩ are equivalent if for any label trace λ one of the following conditions holds: (i) both HRMs accept λ, (ii) both HRMs reject λ, or (iii) neither of the HRMs accepts or rejects λ. We now have all the required definitions to prove Theorem 1, which is restated below. Theorem 1. Given an HRM H, there exists an equivalent flat HRM H. To prove the theorem, we introduce an algorithm for flattening any HRM. Without loss of generality, we work on the case of an HRM with two hierarchy levels; that is, an HRM consisting of a root RM that calls flat RMs. Note that an HRM with an arbitrary number of levels can be flattened by considering the RMs in two levels at a time. We start flattening RMs in the second level (i.e., with height 2), which use RMs in the first level (by definition, these are already flat), and once the second level RMs are flat, we repeat the process with the levels above until the root is reached. This process is applicable since, by assumption, the hierarchies do not have cyclic dependencies nor recursion. For simplicity, we use the MDP reward assumption made in Section 2, i.e. the reward transition function of any RM M i is r i (u, u ′ ) = 1[u / ∈ U A i ∧ u ′ ∈ U A i ] like in Section 4. However, the proof below could be adapted to arbitrary definitions of r i (u, u ′ ). Preliminary Transformation Algorithm. Before proving Theorem 1, we introduce an intermediate step that transforms a flat HRM into an equivalent one that takes contexts with which it may be called into account. Remember that a call to an RM is associated with a context. In the case of two-level HRMs such as the ones we are considering in this flattening process, the context and the exit condition from the called flat RM must be satisfied. Crucially, the context must only be satisfied at the time of the call; that is, it only lasts for a single transition. Therefore, if we revisit the initial state of the called RM by taking an edge to it, the context should not be checked anymore. To make the need for this transformation clearer, we use the HRM illustrated in Figure 4a . The flattening algorithm described later embeds the called RM into the caller one; crucially, the context of the call is taken into account by putting it in conjunction with the outgoing edges from the initial state of the called RM.foot_1 Figure 4b is a flat HRM obtained using the flattening algorithm; however, it does not behave like the HRM in Figure 4a . Following the definition of the hierarchical transition function δ H , the context of a call only lasts for a single transition in the called RM in Figure 4a (i.e., a ∧ ¬c is only checked when M 1 is started), but the context is kept permanently in Figure 4b , which is problematic if we go back to the initial state at some point. We later come back to this example after presenting the transformation algorithm. To deal with the situation above, we need to transform an RM to ensure that contexts are only checked once from the initial state. We describe this transformation as follows. Given a flat HRM H = ⟨M, M r , P⟩ with root M r = ⟨U r , P, φ r , r r , u 0 r , U A r , U R r ⟩, we construct a new HRM H ′ = ⟨M ′ , M ′ r , P⟩ with root M ′ r = ⟨U ′ r , P, φ ′ r , r ′ r , u 0 r , U A r , U R r ⟩ such that: • U ′ r = U r ∪ {û 0 r } , where û0 r plays the role of the initial state after the first transition is taken. • The state transition function φ ′ r is built by copying φ r and applying the following changes: 1. Remove the edges to the actual initial state from any state v ∈ U ′ r : φ ′ r (v, u 0 r , M ⊤ ) = ⊥. Note that since the RM is flat, the only callable RM is the leaf M ⊤ .

2.. Add edges to the dummy initial state û0

r from all states v ∈ U ′ r that had an edge to the actual initial state: φ ′ r (v, û0 r , M ⊤ ) = φ r (v, u 0 r , M ⊤ ). u 0 0 u A 0 M 1 | ¬c M 0 u 0 1 u 1 1 u A 1 M ⊤ | a M ⊤ | b M ⊤ | c ∧ ¬b M 1 (a) Original HRM with root M0.  u 0 0 u 1 1 u A 0 M ⊤ | a ∧ ¬c M ⊤ | b M ⊤ | c ∧ ¬b (b) Flattened HRM without transforming M1. u 0 1 u 1 1 û0 1 u A 1 M ⊤ | a M ⊤ | b M ⊤ | a M ⊤ | c ∧ ¬b (c) Transformed M1 from (a). u 0 0 u 1 1 û0 1 u A 0 M ⊤ | a ∧ ¬c M ⊤ | b M ⊤ | a M ⊤ | c ∧ ¬b

3.. Add edges from the dummy initial state û0

r to all those states v ∈ U ′ r that the actual initial state u 0 r points to: φ ′ r (û 0 r , v, M ⊤ ) = φ ′ r (u 0 r , v, M ⊤ ). • The reward transition function r ′ r (u, u ′ ) = 1[u / ∈ U A r ∧ u ′ ∈ U A r ] is defined as stated at the beginning of the section. The HRM H ′ is such that M ′ = {M ′ r , M ⊤ }. Note that this transformation is only required in HRMs where the RMs have initial states with incoming edges. We now prove that this transformation is correct; that is, the HRMs are equivalent. There are two cases depending on whether the initial state has incoming edges or not. First, if the initial state u 0 r does not have incoming edges, step 1 does not remove any edges going to u 0 r , and step 2 does not add any edges going to û0 r , making it unreachable. Even though edges from û0 r to other states may be added, it is irrelevant since it is unreachable. Therefore, we can safely say that in this case, the transformed HRM is equivalent to the original one. Second, if the initial state has incoming edges, we prove equivalence by examining the traversals H(λ) and H ′ (λ) for the original HRM H = ⟨M, M r , P⟩ and the transformed one H ′ = ⟨M ′ , M ′ r , P⟩ given a generic label trace λ = ⟨L 0 , . . . , L n ⟩. By construction, both H(λ) and H ′ (λ) will be identical until reaching a state w with an outgoing transition to u 0 r in the case of H and the dummy initial state û0 r in the case of H ′ . More specifically, upon reaching w and satisfying an outgoing formula to the aforementioned states, the traversals are: H(λ) = ⟨⟨M r , u 0 r , ⊤, []⟩, . . . , ⟨M r , w, ⊤, []⟩⟩, H ′ (λ) = ⟨⟨M ′ r , u 0 r , ⊤, []⟩, . . . , ⟨M ′ r , w, ⊤, []⟩⟩. By construction, state w is in both HRMs, and both of the aforementioned transitions from this state are associated with the same formula, i.e. φ r (w, u 0 r , M ⊤ ) = φ ′ r (w, û0 r , M ⊤ ). Therefore, if one of them is satisfied, the other will be too, and the traversals will become: H(λ) = ⟨⟨M r , u 0 r , ⊤, []⟩, . . . , ⟨M r , w, ⊤, []⟩, ⟨M r , u 0 r , ⊤, []⟩⟩, H ′ (λ) = ⟨⟨M ′ r , u 0 r , ⊤, []⟩, . . . , ⟨M ′ r , w, ⊤, []⟩, ⟨M ′ r , û0 r , ⊤, []⟩⟩. We stay in u 0 r and û0 r until a transition to a state w ′ is satisfied. By construction, w ′ is in both HRMs and the same formula is satisfied, i.e., φ r (u 0 r , w ′ , M ⊤ ) = φ ′ r (û 0 , w ′ , M ⊤ ). The hierarchy traversals then become: H(λ) = ⟨⟨M r , u 0 r , ⊤, []⟩, . . . , ⟨M r , w, ⊤, []⟩, ⟨M r , u 0 r , ⊤, []⟩, . . . , ⟨M r , u 0 r , ⊤, []⟩, ⟨M r , w ′ , ⊤, []⟩⟩, H ′ (λ) = ⟨⟨M ′ r , u 0 r , ⊤, []⟩, . . . , ⟨M ′ r , w, ⊤, []⟩, ⟨M ′ r , û0 r , ⊤, []⟩, . . . , ⟨M ′ r , û0 r , ⊤, []⟩, ⟨M ′ r , w ′ , ⊤, []⟩⟩. From here both traversals will be the same until transitions to u 0 r and û0 r are respectively satisfied again (if any) in H and H ′ . Clearly, the only change in H(λ) with respect to H ′ (λ) (except for the different roots) is that the hierarchy states of the form ⟨M ′ r , û0 r , ⊤, []⟩ in the latter appear as ⟨M r , u 0 r , ⊤, []⟩ in the former. We now check if the equivalence conditions from Definition 8 hold: • If H(λ) ends with state u 0 r , H ′ (λ) ends with state û0 r following the reasoning above. By construction, neither of these states is accepting or rejecting; therefore, neither of these HRMs accepts or rejects λ. • If H(λ) ends with state w, H ′ (λ) will also end with this state following the reasoning above. Therefore, if w is an accepting state, both HRMs accept λ; if w is a rejecting state, both HRMs reject λ; and if w is not an accepting or rejecting state, neither of the HRMs accepts or rejects λ. Since all equivalence conditions are satisfied for any trace λ, H and H ′ are equivalent. Figure 4c exemplifies the output of the transformation algorithm given M 1 in Figure 4a as input, whereas Figure 4d is the output of the flattening algorithm discussed next, which properly handles the context unlike the HRM in Figure 4b . Flattening Algorithm. We describe the algorithm for flattening an HRM. As previously stated, we assume without loss of generality that the HRM to be flattened consists of two hierarchy levels (i.e., the root calls flat RMs). We also assume that the flat RMs have the form produced by the previously presented transformation algorithm. Given an HRM H = ⟨M, M r , P⟩ with root M r = ⟨U r , P, φ r , r r , u 0 r , U A r , U R r ⟩, we build a flat RM Mr = ⟨ Ūr , P, φr , rr , ū0 r , ŪA r , ŪR r ⟩ using the following steps: 1. Copy the sets of states and initial state from M r (i.e., Ūr = U r , ū0 r = u 0 r , ŪA r = U A r , ŪR r = U R r ). 2. Loop through the non-false entries of the transition function φ r and decide what to copy. That is, for each triplet (u, u ′ , M j ) where u, u ′ ∈ U r and M j ∈ M such that φ r (u, u ′ , M j ) ̸ = ⊥: (a) If M j = M ⊤ (i.e., the called RM is the leaf), we copy the transition: φr (u, u ′ , M ⊤ ) = φ r (u, u ′ , M ⊤ ). (b) If M j ̸ = M ⊤ , we embed the transition function of M j = ⟨U j , P, φ j , r j , u 0 j , U A j , U R j ⟩ into Mr . Remember that M j is flat. To do so, we run the following steps: i. Update the set of states by adding all non-initial and non-accepting states from M j . Similarly, the set of rejecting states is also updated by adding all rejecting states of the called RM. The initial and accepting states from M j are unimportant: their roles are played by u and u ′ respectively. In contrast, the rejecting states are important since, by assumption, they are global. Note that the added states v are renamed to v u,u ′ ,j in order to take into account the edge being embedded: if the same state v was reused for another edge, then we would not be able to distinguish them. Ūr = Ūr ∪ v u,u ′ ,j | v ∈ U j \ {u 0 j } ∪ U A j , ŪR r = ŪR r ∪ v u,u ′ ,j | v ∈ U R j . ii. Embed the transition function φ j of M j into φr . Since M j is flat, we can make copies of the transitions straightaway: the only important thing is to check whether these transitions involve initial or accepting states which, as stated before, are going to be replaced by u and u ′ accordingly. Given a triplet (v, w, M ⊤ ) such that v, w ∈ U j and for which φ j (v, w, M ⊤ ) = ϕ and ϕ ̸ = ⊥ we update φr as follows:foot_2  A. If v = u 0 j and w / ∈ U A j , then φr (u, w u,u ′ ,j , M ⊤ ) = DNF(ϕ ∧ φ r (u, u ′ , M j )). The initial state of M j has been substituted by u, we use the clone of w associated with the call (w u,u ′ ,j ), and append the context of the call to M j to the formula ϕ. B. If v = u 0 j and w ∈ U A j , then φr (u, u ′ , M ⊤ ) = DNF(ϕ ∧ φ r (u, u ′ , M j )) . Like the previous case but performing two substitutions: u replaces v and u ′ replaces w. The context is appended since it is a transition from the initial state of M j . C. If v ̸ = u 0 j and w ∈ U A j , then φr (v u,u ′ ,j , u ′ , M ⊤ ) = ϕ. We substitute the accepting state w by u ′ , and use the clone of v associated with the call (v u,u ′ ,j ). This time the call's context is not added since v is not the initial state of M j . D. If none of the previous cases holds, there are no substitutions to be made nor contexts to be taken into account. Hence, φr (v u,u ′ ,j , w u,u ′ ,j , M ⊤ ) = ϕ. We just use the clones of v and w corresponding to the call (v u,u ′ ,j and w u,u ′ ,j ). 3. We apply the transformation algorithm we described before, and form a new flat HRM H = ⟨{ Mr , M ⊤ }, Mr , P⟩ with the flattened (and transformed) root Mr . The reward transition function r ′ r (u, u ′ ) = 1[u / ∈ ŪA r ∧u ′ ∈ ŪA r ] is defined as stated at the beginning of the section. Note that u might not necessarily be a state of the non-flat root, but derived from an RM with lower height. We now have everything to prove the previous theorem. Without loss of generality and for simplicity, we assume that the transformation algorithm has not been applied over the flattened root (we have already shown that the transformation produces an equivalent flat machine). Theorem 1. Given an HRM H, there exists an equivalent flat HRM H. Proof. Let us assume that an HRM H = ⟨ M, Mr , P⟩, where Mr = ⟨ Ūr , P, φr , rr , ū0 r , ŪA r , ŪR r ⟩, is a flat HRM that results from applying the flattening algorithm on an HRM H = ⟨M, M r , P⟩, where M r = ⟨U r , P, φ r , r r , u 0 r , U A r , U R r ⟩. For these HRMs to be equivalent, any label trace λ = ⟨L 0 , . . . , L n ⟩ must satisfy one of the conditions in Definition 8. To prove the equivalence, we examine the hierarchy traversals H(λ) and H(λ) given a generic label trace λ. Let u ∈ U r be a state in the root M r of H and let φ r (u, u ′ , M ⊤ ) be a satisfied transition from that state. By construction, u is also in the root Mr of the flat hierarchy H, and Mr has an identical transition φr (u, u ′ , M ⊤ ), which must also be satisfied. If the hierarchy states are ⟨M r , u, ⊤, []⟩ and ⟨ Mr , u, ⊤, []⟩ for H and H respectively, then the next hierarchy states upon application of δ H will be ⟨M r , u ′ , ⊤, []⟩ and ⟨ Mr , u ′ , ⊤, []⟩. Therefore, both HRMs behave equivalently when calls to the leaf RM are made. We now examine what occurs when a non-leaf RM is called in H. Let φ r (u, u ′ , M j ) be a satisfied transition in M r , and let φ j (u 0 j , w, M ⊤ ) be a satisfied transition from M j 's initial state. By construction, Mr contains a transition whose associated formula is the conjunction of the previous two, i.e. φ r (u, u ′ , M j ) ∧ φ j (u 0 j , w, M ⊤ ). Now, the hierarchy traversals will be different depending on w: • If w / ∈ U A j (i. e., w is not an accepting state of M j ), by construction, Mr contains the transition φr (u, w u,u ′ ,j , M ⊤ ) = φ r (u, u ′ , M j ) ∧ φ j (u 0 j , w, M ⊤ ). If the current hierarchy states are (the equivalent) ⟨M r , u, ⊤, []⟩ and ⟨ Mr , u, ⊤, []⟩ for H and H, then the next hierarchy states upon application of δ H are ⟨M j , w, ⊤, [⟨u, u ′ , M r , M j , φ r (u, u ′ , M j ), ⊤⟩]⟩ and ⟨ Mr , w u,u ′ ,j , ⊤, []⟩. These hierarchy states are equivalent since w u,u ′ ,j is a clone of w that saves all the call information (i.e., a call to machine M j for transitioning from u to u ′ ). • If w ∈ U A j (i.e., w is an accepting state of M j ), by construction, Mr contains the transition φr (u, u ′ , M ⊤ ) = φ r (u, u ′ , M j ) ∧ φ j (u 0 j , w, M ⊤ ). If the current hierarchy states are (the equivalent) ⟨M r , u, ⊤, []⟩ and ⟨ Mr , u, ⊤, []⟩ for H and H, then the next hierarchy states upon application of δ H are ⟨M r , u ′ , ⊤, []⟩ and ⟨ Mr , u ′ , ⊤, []⟩. These hierarchy states are clearly equivalent since the machine states are exactly the same. We now check the case in which we are inside a called RM. Let φ r (u, u ′ , M j ) be the transition that caused H to start running M j , and let φ j (v, w, M ⊤ ) be a satisfied transition within M j such that v ̸ = u 0 j . By construction, Mr contains a transition associated with the same formula φ j (v, w, M ⊤ ). The hierarchy traversals vary depending on w: • If w / ∈ U A j (i. e., w is not an accepting state of M j ), by construction, Mr contains the transition φr (v u,u ′ ,j , w u,u ′ ,j , M ⊤ ) = φ j (v, w, M ⊤ ). For the transition to be taken in H, the hierarchy state must be ⟨M j , v, ⊤, [⟨u, u ′ , M r , M j , φ r (u, u ′ , M j ), ⊤⟩]⟩, whereas in H it will be ⟨ Mr , v u,u ′ ,j , ⊤, []⟩. These hierarchy states are clearly equivalent: v u,u ′ ,j is a clone of v that saves all information related to the call being made (the called machine, and the starting and resulting states in the transition). Upon application of δ H , the hierarchy states will remain equivalent: ⟨M j , w, ⊤, [⟨u, u ′ , M r , M j , φ r (u, u ′ , M j ), ⊤⟩]⟩ and ⟨ Mr , w u,u ′ ,j , ⊤, []⟩ (again w u,u ′ ,j saves all the call information, just like the stack). • If w ∈ U A j (i.e., w is an accepting state of M j ), by construction, Mr contains the transition φr (v u,u ′ ,j , u ′ , M ⊤ ) = φ j (v, w, M ⊤ ). This case corresponds to that where control is returned to the calling RM. Like in the previous case, for the transition to be taken in H, the hierarchy state must be ⟨M j , v, ⊤, [⟨u, u ′ , M r , M j , φ r (u, u ′ , M j ), ⊤⟩]⟩, whereas in H it will be ⟨ Mr , v u,u ′ ,j , ⊤, []⟩. The resulting hierarchy states then become ⟨M r , u ′ , ⊤, []⟩ and ⟨ Mr , u ′ , ⊤, []⟩ respectively, which are clearly equivalent (the state is exactly the same and both come from equivalent hierarchy states). We have shown both HRMs have equivalent traversals for any given trace, implying that both will accept, reject, or not accept nor reject a trace. Therefore, the HRMs are equivalent. Figure 5a shows the result of applying the flattening algorithm on the BOOK HRM shown in Figure 1b . Note that the resulting HRM can be compressed: there are two states having an edge with the same label to a specific state. Indeed, the presented algorithm might not produce the smallest possible flat equivalent. Figure 5b shows the resulting compressed HRM, which is like Figure 1c but naming the states following the algorithm for clarity. Estimating how much a flat HRM (or any HRM) can be compressed and designing an algorithm to perform such compression are left as future work. u 0 0 u 1 1:0,1 u 1 0 u 1 2:1,3 u 1 2:0,2 u 2 0 u 1 1:2,3 u 3 0 u A 0 ∧ ¬ (a) Without compression. u 0 0 u 1 1:0,1 u 1 0 u 1 2:0,2 u 2 0 u 1 2:1,3 u 3 0 u A 0 ∧ ¬ (b) With compression. Figure 5 : Results of flattening the HRM in Figure 1b . The notation u i j:x,y denotes the i-th state of RM j in the call between states x and y in the parent RM. Note that x and y appear only if that state comes from a called RM. The blue states and edges in (a) can be compressed as shown in (b).

B.2 PROOF OF THEOREM 2

We prove Theorem 2 by first characterizing an HRM H using a set of abstract parameters. Then, we describe how the number of states and edges in an HRM and its corresponding flat equivalent are computed, and use these quantities to give an example for which the theorem holds. The parameters are the following: • The height of the root h r . • The number of RMs with height i, N (i) . • The number of states in RMs with height i, U (i) . • The number of edges from each state in RMs with height i, E (i) . We assume that (i) RMs with height i only call RMs with height i -1; (ii) all RMs have a single accepting state and no rejecting states; (iii) all RMs except for the root are called; and (iv) the HRM is well-formed (i.e., it behaves deterministically and there are no cyclic dependencies). Note that N (hr) = 1 since there is a single root. Assumption (i) can be made since for the root to have height h r we need it to call at least one RM with height h r -1. Considering that all called RMs are have the same height simplifies the analysis since we can characterize the RMs at each height independently. Assumption (ii) is safe to make since a single accepting state is enough, and helps simplify the counting since only some RMs could have rejecting states. Assumption (iii) ensures that the flat HRM will comprise all RMs in the original HRM. This is also a fair assumption: if a given RM is not called by any RM in the hierarchy, we could remove it beforehand. The number of states |H| in the HRM H is obtained by summing the number of states of each RM: |H| = hr i=1 N (i) U (i) . The number of states | H| in the flat HRM H is given by the number of states in the flattened root RM | H| = Ū (hr) , where Ū (i) is the number of states in the flattened representation of an RM with height i, which is recursively defined as: Ū (i) = U (i) if i = 1, U (i) + Ū (i-1) -2 U (i) -1 E (i) if i > 1. That is, the number of states in a flattened RM with height i has all states that the non-flat HRM had. In addition, for each of the U (i) -1 non-accepting states in the non-flat RM, there are E (i) edges, each of which calls an RM with height i -1 whose number of states is Ū (i-1) . These edges are replaced by the called RM except for the initial and accepting states, whose role is now played by the states involved in the substituted edge (hence the -2). This construction process corresponds to the one used to prove Theorem 1. The total of number of edges in an HRM is given by: hr i=1 N (i) (U (i) -1)E (i) , where (U (i) -1)E (i) is the total number of edges in an RM with height i (the -1 is because the accepting state is discarded), so N (i) (U (i) -1)E (i) determines how many edges there are across RMs with height i. The total number of edges in the flat HRM is given by the total number of edges in the flattened root RM, Ē(hr) , where Ē(i) is the total number of edges in the flattened representation of an RM with height i, which is recursively defined as follows: Ē(i) = (U (i) -1)E (i) if i = 1, (U (i) -1)E (i) Ē(i-1) if i > 1. That is, each of the (U (i) -1)E (i) edges in an RM with height i is replaced by Ē(i-1) edges given by an RM with height i -1 (if any). The key intuition is that an HRM with root height h r > 1 is beneficial representation-wise if the number of calls across RMs with height i is higher than the number of RMs with height i -1; in other words, RMs with lower heights are being reused. Numerically, the total number of edges/calls in an RM with height i is (U (i) -1)E (i) and, therefore, the total number of calls across RMs with height i is (U (i) -1)E (i) N (i) . If this quantity is higher than N (i-1) , then RMs with lower heights are reused, and therefore having RMs with different heights is beneficial. Theorem 2. Let H = ⟨M, M r , P⟩ be an HRM and let h r be the height of its root M r . The number of states and edges in an equivalent flat HRM H can be exponential in h r . Proof. By example. Let H = ⟨M, M r , P⟩ be an HRM whose root M r has height h r and is parameterized by N (i) = 1, U (i) = 3, E (i) = 1 for i = 1, . . . , h r . Figure 6 shows an instance of this hierarchy. Let us write the number of states in the flat RMs of each level: Ū (1) = U (1) = 3, Ū (2) = U (2) + Ū (1) -2 U (2) -1 E (2) = 3 + (3 -2) (3 -1) 1 = 5, Ū (3) = U (3) + Ū (2) -2 U (3) -1 E (3) = 3 + (5 -2) (3 -1) 1 = 9, . . . Ū (i) = 2 Ū (i-1) -1 = 2 i + 1. Hence, the number of states in the flat HRM is | H| = Ū (hr) = 2 hr + 1, showing that the number of states in the flat HRM grows exponentially with the height of the root. In contrast, the number of states in the HRM grows linearly with the height of the root, In the case of the total number of edges, we again write some iterations to derive a general expression: |H| = hr i=1 N (i) U (i) = hr i=1 1 • 3 = 3h r . u 0 i u 1 i u A i M i-1 | ⊤ M i-1 | ⊤ M i , 1 < i ≤ h r u 0 1 u 1 1 u A 1 M ⊤ | a M ⊤ | b M 1 Ē(1) = (U (1) -1)E (1) = (3 -1)1 = 2, Ē(2) = (U (2) -1)E (2) Ē(1) = (3 -1) • 1 • 2 = 4, Ē(3) = (U (3) -1)E (3) Ē(2) = (3 -1) • 1 • 4 = 8, . . . Ē(i) = 2 Ē(i-1) = 2 i . Therefore, the total number of edges in the flat HRM is Ē(hr) = 2 hr . In contrast, the total number of edges in the HRM grows linearly: hr i=1 N (i) (U (i) -1)E (i) = hr i=1 1(3 -1)1 = 2h r . Finally, we emphasize that the resulting flat HRM cannot be compressed, unlike the HRM in Figure 5 : each state has at most one incoming edge, so there are not multiple paths that can be merged. We have thus shown that there are HRMs whose equivalent flat HRM has a number of states and edges that grows exponentially with the height of the root. Using the aforementioned intuition, we observe that the hierarchical structure is actually expected to be useful: the number of calls across RMs with height i is (U (i) -1)E (i) = (3 -1)1 = 2, which is greater than the number of RMs with height i -1 (only 1). There are cases where having a multi-level hierarchy (i.e., with h r > 1) is not beneficial. For instance, given an HRM whose root has height h r and parameterized by N (i) = 1, U (i) = 2 and E (i) = 1, the number of states in the equivalent flat HRM is constant (2), whereas in the HRM itself it grows linearly with h r . The same occurs with the number of edges. By checking the previously introduced intuition, we observe that (U (i) -1)E (i) N (i) = (2 -1) • 1 • 1 = 1 ̸ > N (i-1) = 1, which verifies that having non-reused RMs with multiple heights is not useful.

C POLICY LEARNING IMPLEMENTATION DETAILS

In this appendix, we describe some implementation details that were omitted in Section 4 for simplicity. First, we start by describing some methods used in policy learning. Second, we explain the option selection algorithm step-by-step and provide examples to ease its understanding.

C.1 POLICIES

Deep Q-networks (DQNs; Mnih et al., 2015) . We use Double DQNs (van Hasselt et al., 2016) for both formula and call options. The DQNs associated with formula options simply take an MDP state and output a Q-value for each action. In contrast, the DQNs associated with call options also take an RM state and a context, which are encoded as follows: • The RM state is encoded using a one-hot vector. The size of the vector is given by the number of states in the RM. • The context, which is either ⊤ or a DNF formula with a single disjunct/conjunction, is encoded using a vector whose size is the number of propositions |P|. Each vector position corresponds to a proposition p ∈ P whose value depends on how p appears in the context: (i) +1 if p appears positively, (ii) -1 if p appears negatively, or (iii) 0 if p does not appear. Note that if the context is ⊤, the vector solely consists of zeros. These DQNs output a value for each possible call in the RM; however, some of these values must be masked if the corresponding calls are not available from the RM state-context used as input. For instance, the DQN for M 0 in Figure 1b outputs a value for ⟨M 1 , ¬ ⟩, ⟨M 2 , ⊤⟩, ⟨M 1 , ⊤⟩, and ⟨M ⊤ , ⟩. If the RM state was u 0 0 and the context was ⊤, only the values for the first two calls are relevant. Just like unavailable calls, we also mask unsatisfiable calls (i.e., calls whose context cannot be satisfied in conjunction with the accumulated context used as input). To speed up learning, a subset of the Q-functions associated with formula options is updated after each step. Updating all the Q-functions after each step is costly and we observed that similar performance could be obtained with this strategy. To determine the subset, we keep an update counter c ϕ for each Q-function q ϕ , and a global counter c (i.e., the total number of times Q-functions have been updated). The probability of updating q ϕ is: p ϕ = s ϕ ϕ ′ s ϕ ′ , where s ϕ = c -c ϕ -1. A subset of Q-functions is chosen using this probability distribution without replacement. Exploration. During training, the formula and call option policies are ϵ-greedy. In the case of formula options, akin to Q-functions, each option ω j,ϕ i,u,Φ performs exploration with an exploration factor ϵ ϕ∧Φ , which linearly decreases with the number of steps performed using the policy induced by q ϕ∧Φ . Likewise, Kulkarni et al. ( 2016) keep an exploration factor for each subgoal, but vary it depending on the option's success rather than on the number of performed steps. In the case of call options, each RM state-context pair is associated with its own exploration factor, which linearly decreases as options started from that pair terminate. The Formula Tree. As explained in Section 4, each formula option's policy is induced by a Qfunction associated with a formula. In domains where certain proposition sets cannot occur, it is unnecessary to consider formulas that cover some of these sets. For instance, in a domain where two propositions a and b cannot be simultaneously observed (i.e., it is impossible to observe {a, b}), formulas such as a ∧ ¬b or b ∧ ¬a could instead be represented by the more abstract formulas a or b; therefore, a ∧ ¬b and a could be both associated with a Q-function q a , whereas b ∧ ¬a and b could be both associated with a Q-function q b . By reducing the number of Q-functions, the learning naturally becomes more efficient. We represent relationships between formulas using a formula tree which, as the name suggests, arranges a set of formulas in a tree structure. Formally, given a set of propositions P, a formula tree is a tuple ⟨F, F r , L⟩, where F is a set of nodes, each associated with a formula; F r ∈ F is the root of the tree and it is associated with the formula ⊤; and L ⊆ (2 P ) * is a set of labels. All the nodes in the tree except for the root are associated with conjunctions. Let ν(X) ⊆ 2 2P denote the set of 2) indicates that Y is a special case of X (it adds literals but it is satisfied by exactly the same labels). The tree is organized such that the formula at a given node subsumes all its descendants. The set of Q-functions is determined by the children of the root. During the agent-environment interaction, the formula tree is updated if (i) a new formula appears in the learned HRMs, or (ii) a new label is observed. Algorithm 1 contains the pseudo-code for updating the tree in these two cases. When a new formula is added (line 1), we create a node for the formula (line 2) and add it to the tree. The insertion place is determined by exploring the tree topdown from the root F r (lines 3-19). First, we check whether a child of the current node subsumes the new node (line 7). If such a node exists, then we go down this path (lines 8-9); otherwise, the new node is going to be a child of the current node (lines 16-17). In the latter case, in addition, all those children nodes from the current node that are subsumed by the new node need to become children of the new node (lines 11-15). The other core case in which the tree may need an update occurs when a new label is observed (lines 20-25) since we need to make sure that parenting relationships comply with the set of labels L. First, we find nodes inconsistent with the new label: a parenting relationship is broken (line 39) when the formula of the parent non-root node is satisfied by the label but the formula of the child node is not (or vice versa). Once the inconsistent nodes are found, we remove their current parenting relationship (lines 45-46) and reinsert them in the tree (line 47). Figure 7 shows two simple examples of formula trees, where the Q-functions are q a in (a), and q a and q a∧¬c in (b).

C.2 OPTION SELECTION ALGORITHM

Algorithm 2 shows how options are selected, updated and interrupted during an episode. Lines 1-3 correspond to the algorithm's initialization. The initial state is that of the environment, while the initial hierarchy state is formed by the root RM M r , its initial state u 0 r , an empty context (i.e., Φ = ⊤), and an empty call stack. The option stack Ω H contains the options we are currently running, where options at the front are the shallowest ones (e.g., the first option in the list is taken in the root RM). The steps taken during an episode are shown in lines 4-14, which are grouped as follows: 1. The agent fills the option stack Ω H by selecting options in the HRM from the current hierarchy state until a formula option is chosen (lines 15-25). The context is propagated and augmented through the HRM (i.e., the context of the calls is conjuncted with the propagating context and converted into DNF form). Note that the context is initially ⊤ (true), and not that of the hierarchy state. It is possible that no new options are selected if the formula option chosen in a previous step has not terminated yet. 2. The agent chooses an action according to the last option in the option stack (line 6), which will always be a formula option whose policy maps states into actions. The action is ap-

Algorithm 1 Formula tree operations

Input: A formula tree ⟨F, F r , L⟩, where F is a set of nodes, F r ∈ F is the root node (associated with the formula ⊤), and L is a set of labels. As a result of the definition of the hierarchical transition function δ H , the contexts in the stack may be DNF formulas with more than one disjunct. In contrast, the contexts associated with options are either ⊤ or DNFs with a single disjunct (remember that an option is formed for each disjunct). For instance, this occurs if the context is a ∨ b and {a, b} is observed: since both disjuncts are satisfied, the context shown in the call stack will be the full disjunction a ∨ b. In the simplest case, the derived option (which as said before is associated with a DNF with a single disjunct or ⊤) can include one of these disjuncts chosen uniformly at random (line 67). Alternatively, we could memorize all the derived options and perform identical updates for both later on once terminated.

C.3 EXAMPLES

We briefly describe some examples of how policy learning is performed in the HRM of Figure 1b . We first enumerate the options in the hierarchy. The formula options are ω ⊤, 1,0,¬ , ω ⊤, 2,0,⊤ , ω ⊤, 1,0,⊤ , ω ⊤, 1,1,⊤ , ω ⊤, 2,1,⊤ , and ω ⊤, 0,3,⊤ . The first option should lead the agent to observe the label { } to satisfy ¬ . The Q-functions associated with this set of options are q ∧¬ , q , q , q and q . Note that ω ⊤, 1,1,⊤ and ω ⊤, 2,1,⊤ are both associated with q . Conversely, the call options are ω 1,¬ 0,0,⊤ , ω 2,⊤ 0,0,⊤ , ω 2,⊤ 0,1,⊤ , and ω 1,⊤ 0,2,⊤ , where the first one achieves its local goal if formula options ω ⊤, 1,0,¬ and ω ⊤, 1,1,⊤ sequentially achieve theirs. The associated Q-functions are q 0 , q 1 and q 2 . Note that ω 2,⊤ 0,0,⊤ and ω 2,⊤ 0,1,⊤ are both associated with q 2 . We now describe a few steps of the aforementioned option selection algorithm in two scenarios. First, we consider the scenario where all chosen options are run to completion (i.e., until their local goals are achieved): 1. The initial hierarchy state is ⟨M 0 , u 0 0 , ⊤, []⟩ and the option stack Ω H is empty. We select options to fill Ω H . The first option is chosen from u 0 0 in M 0 using a policy induced by q 0 . At this state, the available options are ω 1,¬ 0,0,⊤ and ω 2,⊤ 0,0,⊤ . Let us assume that the former is chosen. Then an option from the initial state of M 1 under context ¬ is chosen, which can only be ω ⊤, 1,0,¬ . Since this option is a formula option (the call is made to M ⊤ ), we do not select any more options and the option stack is Ω H = ⟨ω 1,¬ 0,0,⊤ , ω ⊤, 1,0,¬ ⟩. Algorithm 2 Episode execution using an HRM (continues in p. 27) Input: An HRM H = ⟨M, M r , P⟩ and an environment ENV = ⟨S, A, p, r, γ, P, l, τ ⟩.  Ω β , Ω H ← TERMINATEOPTIONS(Ω H , s, ⟨M i , u, Φ, Γ⟩, ⟨M j , u ′ , Φ ′ , Γ ′ ⟩) 11: UPDATECALLQFUNCTIONS(Ω β , s t+1 , l(s t+1 )) 12: if |Ω β | > 0 then 13: Ω H ALIGNOPTIONSTACK(Ω H , Γ ′ , Ω β ) 14: ⟨M i , u, Φ, Γ⟩ ← ⟨M j , u ′ , Φ ′ , Γ ′ ⟩ 15: procedure FILLOPTIONSTACK(s, ⟨M i , u, •, Γ⟩, Ω H ) 16: Ω ′ H ← Ω H 17: Φ ← ⊤ ▷ The context is initially true 18: M j ← M i ; v ← u ▷ The state-automaton pair in which an option is selected 19: while the last option in Ω ′ H is not a formula option do 20: ω x,ϕ j,v,Φ ← SELECTOPTION(s, M j , v, Φ) ▷ Select an option (e.g., with ϵ-greedy)  21: if x ̸ = ⊤ then ▷ If the option is a call option 22: M j ← M x ; v ← u 0 x ▷ Next option is chosen on the called RM's initial state 23: Φ ← DNF(Φ ∧ ϕ) ▷ Update the context 24: Ω ′ H ← Ω H ⊕ ω x,ϕ j,v Ω β ← Ω β ⊕ ω x,ϕ k,v,Ψ ▷ Update the list of terminated options 36:  Ω ′ H ← Ω ′ H ⊖ ω x,ϕ k,v if ⟨M i , u, Φ, Γ⟩ ̸ = ⟨M j , u ′ , Φ ′ , Γ ′ ⟩ then ▷ If Ω β ← Ω β ⊕ ω x,ϕ k,v,Ψ ▷ Update the list of terminated options 42: for l = 0 . . . |Γ| -1 do 48: return Ω H 61: procedure ALIGNOPTIONSTACKHELPER(Ω H , Γ, Ω β , stack index) 62:  Ω ′ H ← Ω ′ H ⊖ ω x,ϕ k,v ⟨u f , •, M i , M j , ϕ ′ , Φ ′ ⟩ ← Γ l 49: if u f = v ∧ i = k ∧ j = x ∧ ϕ ⊆ ϕ ′ ∧ Φ ⊆ Φ ′ then ▷ The call Ω ′ H ← Ω H 63: ω •, ⟨u f , •, M i , M j , ϕ, •⟩ ← Γ l 67: ϕ sel ← Select disjunct from ϕ (e.g., randomly) 68: Ω ′ H ← Ω ′ H ⊕ ω j,ϕ sel i,u f ,Φ ′ ▷ Append new option to the option stack 69: Φ ′ ← DNF(Φ ′ ∧ ϕ) 70: return Ω ′ H 2. The agent selects options according to the formula option in Ω H , ω ⊤, 1,0,¬ , whose policy is induced by q ∧¬ . Let us assume that the policy tells the agent to turn right. Since the label at this location is empty, the hierarchy state remains the same; therefore, no options terminate, and the option stack does not change. 3. Let us assume that the agent moves forward twice, thus observing { }. The hierarchy state then becomes ⟨M 1 , u 1 1 , ⊤, [⟨u 0 0 , u 1 0 , M 0 , M 1 , ¬ , ⊤⟩]⟩ (see Appendix A for a step-by-step application of the hierarchical transition function). We check which options in Ω H have terminated starting from the last chosen one. The formula option ω ⊤, 1,0,¬ terminates because the hierarchy state has changed. In contrast, the call option ω 1,¬ 0,0,⊤ does not terminate since there is an item in the call stack, ⟨u 0 0 , u 1 0 , M 0 , M 1 , ¬ , ⊤⟩ that can be mapped into it (meaning that the option is running). 4. An experience (s, ω ⊤, 1,0,¬ , s ′ ) is formed for the terminated option, where s and s ′ are the observed tuples on initiation and termination respectively. This tuple is added to the replay buffer associated with the RM where the option appears, D 1 , since it achieved its goal (i.e., a label that satisfied ∧ ¬ was observed). 5. We align Ω H with the new stack. In this case, Ω H remains unchanged since its only option can be mapped into an item of the new stack. 6. We start a new step. Since the option stack does not contain a formula option, we select new options from the current hierarchy state according to a policy induced by q 1 . In this case, there is a single eligible option: ω ⊤, 1,1,⊤ . In the second scenario, we observe what occurs when the HRM traversal differs from the options chosen by the agent: 1. The initial step is like the one in the previous scenario, but we assume ω 2,⊤ 0,0,⊤ is selected instead. Then, since this is a call option, an option from the initial state of M 2 under context ⊤ is chosen, which can only be ω ⊤, 2,0,⊤ . The option stack thus becomes Ω H = ⟨ω 2,⊤ 0,0,⊤ , ω ⊤, 2,0,⊤ ⟩. 2. Let us assume that by taking actions according to ω ⊤, 2,0,⊤ we end up observing { }. Like in the previous scenario, the hierarchy state becomes ⟨M 1 , u 1 1 , ⊤, [⟨u 0 0 , u 1 0 , M 0 , M 1 , ¬ , ⊤⟩]⟩. We check which options in Ω H have terminated. The formula option ω ⊤, 2,0,⊤ terminates since the hierarchy state has changed, and the call option ω 2,⊤ 0,0,⊤ also terminates since it cannot be mapped into an item of the call stack. Note that these options should intuitively finish since the HRM is being traversed through a path different from that chosen by the agent. 3. The replay buffers are not updated for these options since they have not achieved their local goals. 4. We align Ω H with the new stack. The only item of the stack ⟨u 0 0 , u 1 0 , M 0 , M 1 , ¬ , ⊤⟩ can be mapped into option ω 1,¬ 0,0,⊤ . We assume that this option starts on the same tuple s and that it has run for the same number of steps as the last terminated option ω 2,⊤ 0,0,⊤ .

D CURRICULUM LEARNING IMPLEMENTATION

We here describe the details of the curriculum learning method described in Section 5. When an episode is completed for M ij , R ij is updated using the episode's undiscounted return r as R ij ← βR ij + (1 -β)r, where β ∈ [0, 1] is a hyperparameter. A score c ij = 1 -R ij is computed from the return and used to determine the probability of selecting tasks and instances. Note that this scoring function, also used in the curriculum method by Andreas et al. (2017) , assumes that the undiscounted return ranges between 0 and 1 (see Section 2). The probability of choosing task i is max j c ij / k max l c kl ; that is, the task for which an instance is performing very poorly has a higher probability. Having selected task i, the probability of choosing instance j is c ij / k c ik , i.e. instances where performance is worse have a higher probability of being chosen.

E LEARNING AN HRM FROM TRACES WITH ILASP

We formalize the task of learning an HRM using ILASP (Law et al., 2015) , an inductive logic programming system that learns answer set programs (ASP) from examples. We address the reader to Gelfond & Kahl (2014) for an introduction to ASP, and to Law (2018) for ILASP. Our formalization is close to that by Furelos-Blanco et al. ( 2021) for flat finite-state machines. Without loss of generality, as stated in Section 5, we assume that each RM has exactly one accepting and one rejecting state. We first describe how HRMs are represented in ASP, and then explain the encoding of the HRM learning task in ILASP.

E.1 REPRESENTATION OF AN HRM IN ANSWER SET PROGRAMMING

In this section, we explain how HRMs are represented using Answer Set Programming (ASP). First, we describe how traces are represented. Then, we present how HRMs themselves are represented and also introduce the general rules that describe the behavior of these hierarchies. Finally, we prove the correctness of the representation. We use A(X) to denote the ASP representation of X (e.g., a trace). Definition 9 (ASP representation of a label trace). Given a label trace λ = ⟨L 0 , . . . , L n ⟩, M (λ) denotes the set of ASP facts that describe it: A(λ) = {label(p, t). | 0 ≤ t ≤ n, p ∈ L t } ∪ {step(t). | 0 ≤ t ≤ n} ∪ {last(n).} . The label(p, t) fact indicates that proposition p ∈ P is observed in step t, step(t) states that t is a step of the trace, and last(n) indicates that the trace ends in step n. Example 2. The set of ASP facts for the label trace λ = ⟨{ }, {}, { }⟩ is A(λ) = {label( , 0)., label( , 2)., step(0), step(1)., step(2)., last(2).}. Definition 10 (ASP representation of an HRM). Given an HRM H = ⟨M, M r , P⟩, A(H) = Mi∈M\{M ⊤ } A(M i ), where: A(M i ) = A U (M i ) ∪ A φ (M i ), A U (M i ) = {state(u, M i ). | u ∈ U i } , currently at. The first fact indicates that the initial state of the root RM is reached from step 0 to step 0. The second rule indicates that the initial state of a non-root RM is reached from step T to step T (i.e., it is reached anytime). The third rule represents the loop transition in the initial state of the root M r : we stay there if no call can be started at T (i.e., we are not moving in the HRM). The fourth rule is analogous to the third but for the accepting state of the root instead of the initial state. Remember this is the only accepting state in the HRM that does not return control to the calling RM. The fifth rule is also similar to the previous ones: it applies to states reached after TO that are non-accepting, which excludes looping in initial states of non-root RMs at the time of starting them (i.e., loops are permitted in the initial state of a non-root RM if we can reach it afterwards by going back to it). The last rule indicates that Y is reached at step T2 in RM M started at T0 if there is an outgoing transition from the current state X to Y at time T that holds between T and T2, and state X has been reached between T0 and T. We will later see how δ is defined. R 2 =                          reachable(u 0 , M r , 0, 0). reachable(u 0 , M, T, T) :state(u 0 , M), M!=M r , step(T). reachable(X, M, T0, T+1) :reachable(X, M, T0, T), not pre sat(X, M, T), step(T), X=u 0 , M=M r . reachable(X, M, T0, T+1) :reachable(X, M, T0, T), not pre sat(X, M, T), step(T), X=u A , M=M r . reachable(X, M, T0, T+1) :reachable(X, M, T0, T), not pre sat(X, M, T), step(T), TO<T, X!=u A . reachable(Y, M, T0, T2) :reachable(X, M, T0, T), δ(X, Y, M, T, T2).                          . The rule set R 3 introduces two predicates. The predicate satisfied(M, T0, TE) indicates that RM M is satisfied if its accepting state u A is reached between steps T0 and TE. Likewise, the predicate failed(M, T0, TE) indicates that RM M fails if its rejecting state u R is reached between steps T0 and TE. These two descriptions correspond to the first and third rules. The second rule applies to the leaf RM M ⊤ , which always returns control immediately; thus, it is always satisfied between any two consecutive steps. 

  

The following set, R 4 , encodes multi-step transitions within an RM. The predicate δ(X, Y, M, T, T2) expresses that the transition from state X to state Y in RM M is satisfied between steps T and T2. The first rule indicates that this occurs if the context labeling a call to an RM M2 is satisfied and that RM is also satisfied (i.e., its accepting state is reached) between these two steps. In contrast, the second rule is used for the case in which the rejecting state of the called RM is reached between those steps. In the latter case, we transition to the local rejecting state u R of M (i.e., the state we would have transitioned to does not matter). This follows from the assumption that rejecting states are global rejectors (see Section 3). The idea of this rule is that rejection is propagated bottom-up in the HRM. R 4 = δ(X, Y, M, T, T2) : -φ(X, Y, , M, M2, T), satisfied(M2, T, T2). δ(X,u R , M, T, T2) : -φ(X, , , M, M2, T), failed(M2, T, T2). . The last set, R 5 , encodes the accepting/rejecting criteria. Remember that the last(T) predicate indicates that T is the last step of a trace. Therefore, the trace is accepted if the root RM is satisfied from the initial step 0 to step T + 1 (the step after the last step of the trace, once the final label has been processed). In contrast, the trace is rejected if a rejecting state in the hierarchy is reached between these two same steps. R 5 = accept :last(T), satisfied(M r , 0, T+1). reject :last(T), failed(M r , 0, T+1). Unlike the formalism introduced in Section 3, this encoding does not use stacks, which would be costly to do. Here we know the trace to be processed and, therefore, the RMs can be evaluated bottom-up; that is, we start evaluating the lowest level RMs first on different subtraces, and the result of this evaluation is used in higher level RMs. We now prove the correctness of the ASP encoding. To do so, we first introduce what means for an HRM to be valid with respect to a trace, as well as a definition and a theorem due to Gelfond & Lifschitz (1988) Proof. First, we prove that the program P = A(H) ∪ R ∪ A(λ * ), where R = 5 i=0 R i , has a unique answer set. By Theorem 3, if P is stratified then it has a unique answer set. We show there is a way of partitioning P following the constraints in Definition 12. A possible partition is P = P 0 ∪P 1 ∪P 2 ∪P 3 , where P 0 = A(λ * ), P 1 = A(H), P 2 = R 0 ∪R 1 , P 3 = R 2 ∪R 3 ∪R 4 ∪R 5 . The unique answer set AS = AS 0 ∪ AS 1 ∪ AS 2 ∪ AS 3 , where AS i corresponds to partition P i , is shown in Figure 8 . For simplicity, λ * [t] denotes the t-th label in trace λ * , λ * [t :] denotes the subtrace starting from the t-th label onwards, and M i (λ * ) denotes the hierarchy traversal using RM M i as the root. We now prove that accept ∈ AS if and only if * = G (i.e., the trace achieves the goal). If * = G then, since the hierarchy is valid with respect to λ * (see Definition 11), the hierarchy traversal H(λ * ) finishes in the accepting state u A of the root; that is, H(λ * )[n + 1] = ⟨M r , u A r , •, •⟩. This holds if and only if accept ∈ AS. The proof showing that reject ∈ AS if and only if * = D (i.e., the trace reaches a dead-end) is similar to the previous one. If * = D then, since the hierarchy is valid with respect to λ * , the hierarchy traversal H(λ * ) finishes in a rejecting state u R ; that is, H(λ * )[n + 1] = ⟨M k , u R , •, •⟩, where M k ∈ M. This holds if and only if reject ∈ AS.

E.2 REPRESENTATION OF THE HRM LEARNING TASK IN ILASP

We here formalize the learning of an HRM and its mapping to a general ILASP learning task. We start by defining the HRM learning task introduced in Section 5. Definition 13. An HRM learning task is a tuple T H = ⟨r, U, P, M, M C , u 0 , u A , u R , Λ, κ⟩, where r is the index of the root RM in the HRM; U ⊇ {u 0 , u A , u R } is a set of states of the root RM always containing an initial state u 0 , an accepting state u A , and a rejecting state u R ; P is a set of propositions; M ⊇ {M ⊤ } is a set of RMs; M C ⊆ M is a set of callable RMs; Λ = Λ G ∪ Λ D ∪ Λ I is a set of label traces; and κ is the maximum number of conjunctions/disjuncts in each formula. An HRM H = ⟨M ∪ {M r }, M r , P⟩ is a solution of T H if and only if it is valid with respect to all the traces in Λ. We make some assumptions about the sets of RMs M: (i) all RMs reachable from RMs in M C must be in M, (ii) all RMs in M are deterministic, and (iii) all RMs in M are defined over the same set of propositions P (or a subset of it). For completeness, we provide the definition of an ILASP task introduced by Law et al. (2016) . The first definition corresponds to the form of the examples taken by ILASP, while the second corresponds to the ILASP tasks themselves.  AS0 = {label(p, t). | 0 ≤ t ≤ n, p ∈ Lt} ∪ {step(t). | 0 ≤ t ≤ n} ∪ {last(n).} , AS1 = {state(u, Mi). | Mi ∈ M \ {M ⊤ }, u ∈ Ui} ∪ call(u, u ′ , x + e, Mi, Mj). Mi ∈ M \ {M ⊤ }, Mj ∈ M, u, u ′ ∈ Ui, φi(u, u ′ , Mj) ̸ = ⊥, x = j-1 k=0 |φi(u, u ′ , M k )|, e ∈ [1, |φi(u, u ′ , Mj)|] ∪    φ(u, u ′ , x + e, Mi, t). 0 ≤ t ≤ n, Mi ∈ M \ {M ⊤ }, Mj ∈ M, u, u ′ ∈ Ui, φi(u, u ′ , Mj) ̸ = ⊥, x = j-1 k=0 |φi(u, u ′ , M k )|, e ∈ [1, |φi(u, u ′ , Mj)|], λ * [t] ̸ |= φi(u, u ′ , Mj)[e]    , AS2 =    φ(u, u ′ , x + e, Mi, t). 0 ≤ t ≤ n, Mi ∈ M \ {M ⊤ }, Mj ∈ M, u, u ′ ∈ Ui, φi(u, u ′ , Mj) ̸ = ⊥, x = j-1 k=0 |φi(u, u ′ , M k )|, e ∈ [1, |φi(u, u ′ , Mj)|], λ * [t] |= φi(u, u ′ , Mj)[e]    ∪ pre sat(u, Mi, t). 0 ≤ t ≤ n, Mi ∈ M \ {M ⊤ }, u ∈ Ui, λ * [t] |= ξ i,u,⊤ , AS3 = reachable(u 0 , Mr, 0, 0). ∪ reachable(u 0 , Mi, t, t). | 0 ≤ t ≤ n, Mi ∈ M \ {M ⊤ , Mr}, u 0 ∈ Ui ∪ reachable(u, Mr, t1, t2). 0 ≤ t1 < t2 ≤ n + 1, u ∈ Ur, H(λ * [t1 :])[t2 -t1] = ⟨Mr, u, •, •⟩ ∪      reachable(u, Mi, t1, t2). 0 ≤ t1 < t2 ≤ n + 1, Mi ∈ M \ {Mr, M ⊤ }, u ∈ Ui, λ * [t1] |= ξ i,u 0 ,⊤ , Mi(λ * [t1 :])[t2 -t1] = ⟨Mi, u, •, •⟩, Mi(λ * [t1 :])[t2 -t1 -1] ̸ = ⟨Mi, u A , •, •⟩      ∪ satisfied(Mr, t1, t2) | 0 ≤ t1 < t2 ≤ n + 1, H(λ * [t1 :])[t2 -t1] = ⟨Mr, u A , •, •⟩      satisfied(Mi, t1, t2). 0 ≤ t1 < t2 ≤ n + 1, Mi ∈ M \ {Mr, M ⊤ }, λ * [t1] |= ξ i,u 0 ,⊤ , Mi(λ * [t1 :])[t2 -t1] = ⟨Mi, u A , •, •⟩, Mi(λ * [t1 :])[t2 -t1 -1] ̸ = ⟨Mi, u A , •, •⟩      ∪ {satisfied(M ⊤ , t, t + 1) | 0 ≤ t ≤ n} ∪ failed(Mr, t1, t2) | 0 ≤ t1 < t2 ≤ n + 1, H(λ * [t1 :])[t2 -t1] = ⟨•, u R , •, •⟩ ∪    failed(Mi, t1, t2). 0 ≤ t1 < t2 ≤ n + 1, Mi ∈ M \ {Mr, M ⊤ }, λ * [t1] |= ξ i,u 0 ,⊤ , Mi(λ * [t1 :])[t2 -t1] = ⟨•, u R , •, •⟩    ∪ δ(u, u ′ , Mi, t, t + 1). 0 ≤ t ≤ n, Mi ∈ M \ {M ⊤ }, u, u ′ ∈ Ui, λ * [t1] |= φi(u, u ′ , M ⊤ ) ∪        δ(u, u ′ , Mi, t1, t2). 0 ≤ t1 < t2 ≤ n + 1, Mi ∈ M \ {M ⊤ }, u, u ′ ∈ Ui, ∃Mj ∈ M \ {M ⊤ } s.t. ϕ = φi(u, u ′ , Mj), λ * [t1] |= ξ j,u 0 ,ϕ , Mj(λ * [t1 :])[t2 -t1] = ⟨Mj, u A , •, •⟩, Mj(λ * [t1 :])[t2 -t1 -1] ̸ = ⟨Mj, u A , •, •⟩        ∪    δ(u, u R , Mi, t1, t2). Mi ∈ M \ {M ⊤ }, u ∈ Ui, 0 ≤ t1 < t2 ≤ n + 1, ∃Mj ∈ M \ {M ⊤ } s.t. ϕ = φi(u, u ′ , Mj), λ * [t1] |= ξ j,u 0 ,ϕ , Mj(λ * [t1 :])[t2 -t1] = ⟨M k , u R , •, •⟩, M k ∈ M    ∪ accept | H(λ * )[n + 1] = ⟨Mr, u A , •, •⟩ ∪ reject | H(λ * )[n + 1] = ⟨M k , u R , •, •⟩, M k ∈ M \ {M ⊤ } . = A(H) ∪ R ∪ A(λ * ) , where H is an HRM, R is the set of general rules and λ * is a label trace. Definition 14. A context-dependent partial interpretation (CDPI) is a pair ⟨⟨e inc , e exc ⟩, e ctx ⟩, where ⟨e inc , e exc ⟩ is a pair of sets of atoms, called a partial interpretation, and e ctx is an ASP program called a context. A program P accepts a CDPI ⟨⟨e inc , e exc ⟩, e ctx ⟩ if and only if there is an answer set AS of P ∪ e ctx such that e inc ⊆ AS and e exc ∩ AS = ∅. Definition 15. An ILASP task is a tuple T = ⟨B, S M , ⟨E + , E -⟩⟩ where B is the ASP background knowledge, which describes a set of known concepts before learning; S M is the set of ASP rules allowed in the hypotheses; and E + and E -are sets of CDPIs called, respectively, the positive and negative examples. A hypothesis H ⊆ S M is an inductive solution of T if and only if (i) ∀e ∈ E + , B ∪ H accepts e, and (ii) ∀e ∈ E -, B ∪ H does not accept e. Given an HRM learning task T H , we map it into an ILASP learning task A(T H ) = ⟨B, S M , ⟨E + , ∅⟩⟩ and use the ILASP system (Law et al., 2015) to find an inductive solution A φ (H) ⊆ S M that covers the examples. Note that we do not use negative examples (E -= ∅). We define the components of A(T H ) below. Background Knowledge. The background knowledge B = B U ∪ B M ∪ R is a set of rules that describe the behavior of the HRM. The set B U consists of state(u, M r ) facts for each state u ∈ U of the root RM with index r we aim to induce, whereas B M = Mi∈M\{M ⊤ } A(M i ) contains the ASP representations of all RMs. Finally, R is the set of general rules introduced in Appendix E.1 that defines how HRMs process label traces. Importantly, the index of the root r in these rules must correspond to the one used in T H . Hypothesis Space. The hypothesis space S M contains all ed and φ rules that characterize a transition from a non-terminal state u ∈ U \ {u A , u R } to a different state u ′ ∈ U \ {u} using edge i ∈ [1, κ]. Formally, it is defined as S M =    call(u, u ′ , i, M ). u ∈ U \ u A , u R , φ(u, u ′ , i, M, T) : -label(p, T), step(T). u ′ ∈ U \ {u} , i ∈ [1, κ] , φ(u, u ′ , i, M, T) : -not label(p, T), step(T). M ∈ M C , p ∈ P    . Example Sets. Given a set of traces Λ = Λ G ∪ Λ D ∪ Λ I , the set of positive examples is defined as E + = {⟨e * , A(λ)⟩ | * ∈ {G, D, I}, λ ∈ Λ * }, where • e G = ⟨{accept}, {reject}⟩, • e D = ⟨{reject}, {accept}⟩, and • e I = ⟨{}, {accept, reject}⟩ are the partial interpretations for goal, dead-end and incomplete traces. The accept and reject atoms express whether a trace is accepted or rejected by the HRM; hence, goal traces must only be accepted, dead-end traces must only be rejected, and incomplete traces cannot be accepted or rejected. Note that the context of each example is the set of ASP facts A(λ) that represents the corresponding trace (see Definition 9). Correctness of the Learning Task. The following theorem captures the correctness of the HRM learning task. Theorem 4. Given an HRM learning task T H = ⟨r, U, P, M, M C , u 0 , u A , u R , Λ, κ⟩, an HRM H = ⟨M ∪ {M r }, M r , P⟩ is a solution of T H if and only if A φ (M r ) is an inductive solution of A(T H ) = ⟨B, S M , ⟨E + , ∅⟩⟩. Proof. Assume H is a solution of T H . ⇐⇒ H is valid with respect to all traces in Λ (i.e., H accepts all traces in Λ G , rejects all traces in Λ D and does not accept nor reject any trace in Λ I ). ⇐⇒ For each example e ∈ E + , R ∪ A(H) accepts e.

⇐⇒

⇐⇒ For each example e ∈ E + , B ∪ A φ (M r ) accepts e (the two programs are identical). ⇐⇒ A φ (M r ) is an inductive solution of A(T H ). Constraints. We introduce several constraints encoding structural properties of the HRMs we want to learn. Some of these constraints are expressed in terms of facts pos(u, u ′ , e, m, p) and neg(u, u ′ , e, m, p), which indicate that proposition p ∈ P appears positively (resp. negatively) in edge e from state u to state u ′ in RM M m . These facts are derived from φ rules in A(H) and injected in the ILASP tasks using meta-program injection (Law et al., 2018) . The following set of constraints ensures that the learned root RM is deterministic using the saturation technique (Eiter & Gottlob, 1995) . The idea is to check determinism top-down by selecting two edges from a given state in the root, each associated with a set of literals. Initially, the set of literals is formed by those in the formula labeling the edges. If a selected edge calls a non-leaf RM, we select an edge from the initial state of the called RM, augment the set of literals with the associated formula, and repeat the process until a call to the leaf RM is reached. We then check if the literal sets are mutually exclusive. If there is a pair of edges from the root that are not mutually exclusive, the solution is discarded. The set of rules is shown below. The first rule states that we keep two saturation IDs, one for each of the edges we select next and for which mutual exclusivity is checked. The second rule chooses a state X in the root, whereas the third rule selects two edges from this state and assigns a saturation ID to each of them. The fourth rule indicates that if one of the edges we have selected so far calls a non-leaf RM, we select one of the edges from the initial state of the called RM and create a new edge with the same saturation ID. The fifth and sixth rules take the propositions for each set of edges (one per saturation ID). The next three rules indicate that if the edges are mutually exclusive (i.e., a proposition appears positively in one set and negatively in the other) or they are the same, then the answer set is saturated. The saturation itself is encoded in the following three rules: an answer set is saturated by adding every possible ed mtx and root point atoms to the answer set. Due to the minimality of answer sets in disjunctive answer set programming, this "maximal" interpretation can only be an answer set if there is no smaller answer set. This will be the case if and only if every choice of edges satisfies the condition (i.e. every choice of ed mtx and root point atoms results in saturation). The constraint encoded in the final rule then discards answer sets in which saturation did not occur, meaning that the remaining solutions must satisfy the condition.                                      sat id(1; 2). root point(X, M) : call(X, , , M, ), M=Mr. ed mtx((X, Y, E, M, M2), SatID) : call(X, Y, E, M, M2) :root point(X, M), sat id(SatID). ed mtx((u0, Y2, E2, M2, M3), SatID) : call(u0, Y2, E2, M2, M3) :ed mtx(( , , , , M2), SatID), M2!=M ⊤ . pos prop(P, ID) :ed mtx((X, Y, E, M, ), ID), pos(X, Y, E, M, P). neg prop(P, ID) :ed mtx((X, Y, E, M, ), ID), neg(X, Y, E, M, P). saturate :pos prop(P, 1), neg prop(P, 2). saturate :pos prop(P, 2), neg prop(P, 1). saturate :ed mtx((X, Y, , M, M2), 1), ed mtx((X, Y, , M, M2), 2), root point(X, M). root point(X, M) :call(X, , , M, ), saturate, M=Mr. ed mtx((X, Y, E, M, M2), SatID) :call(X, Y, E, M, M2), M=Mr, sat id(SatID), saturate. ed mtx((u0, Y, E, M, M2), SatID) :call(u0, Y, E, M, M2), sat id(SatID), saturate. :not saturate.                                      Other required constraints to learn sensible HRMs are shown below. The first rule prevents an edge from being labeled with calls to two different RMs. The second rule prevents edges from being labeled with the same literal both positively and negatively. :call(X, Y, E, M, M2), call(X, Y, E, M, M3), M2!=M3. :pos(X, Y, E, M, P), neg(X, Y, E, M, P). The following constraints are used to speed up the learning of an HRM. First, we extend the symmetry breaking method by Furelos-Blanco et al. (2021) , originally proposed for flat RMs, to our hierarchical setting. The main advantage of this method is that it accelerates learning without restricting the family of learnable HRMs. Other constraints analogous to those in previous work (Furelos-Blanco et al., 2021) that speed up the learning process further are enumerated below. For simplicity, some of these constraints use the auxiliary rule below to define the ed(X, Y, E, M) predicate, which is equivalent to the call(X, Y, E, M, M2) predicate but omitting the called RM: ed(X, Y, E, M) : -call(X, Y, E, M, ). The constraints are the following: • Rule out inductive solutions where an edge calling the leaf M ⊤ is labeled by a formula formed only by negated propositions. The rule below enforces a proposition to occur positively whenever a proposition appears negatively in an edge calling M ⊤ . :neg(X, Y, E, M, ), not pos(X, Y, E, M, ), call(X, Y, E, M,M ⊤ ). • Rule out any inductive solution where an edge from X to Y with index E is not labeled by a positive or a negative literal. This rule only applies to calls to the leaf M ⊤ , thus avoiding unconditional transitions. : not pos(X, Y, E, M, ), not neg(X, Y, E, M, ), call(X, Y, E, M, M ⊤ ). • Rule out inductive solutions containing states different from the accepting and rejecting states without outgoing edges. In general, these states are not interesting. has outgoing edges(X, M) :ed(X, , , M). :state(X, M), not has outgoing edges(X, M), X!=u A , X!=u R . • Rule out inductive solutions containing cycles; that is, solutions where two states can be reached from each other. The path(X, Y, M) predicate indicates there is a directed path (i.e., a sequence of directed edges) from X to Y in RM M. The first rule states that there is a path from X to Y if there is an edge from X to Y. The second rule indicates that there is a path from X to Y if there is an edge from X to an intermediate state Z from which there is a path to Y. Finally, the third rule discards the solutions where X and Y can be reached from each other through directed edges. path(X, Y, M) :ed(X, Y, , M). path(X, Y, M) :ed(X, Z, , M), path(Z, Y, M). :path(X, Y, M), path(Y, X, M).

F EXPERIMENTAL DETAILS

In this section, we describe the details of the experiments introduced in Section 6. We discuss how the domains are implemented, the hyperparameters used to run the algorithms, and provide all specific results through tables and plots. All experiments ran on 3.40GHz Intel ® Core™ i7-6700 processors.

F.1 DOMAINS

The CRAFTWORLD domain is based on MiniGrid (Chevalier-Boisvert et al., 2018) , thus inheriting many of its features. At each step, the agent observes a W × H × 3 tensor, where W and H are the width and height of the grid. The three channels contain the object IDs, the color IDs, and object state IDs (including the orientation of the agent) respectively. Each of the objects we define (except for the lava , which already existed in MiniGrid) has its own object and color IDs. Before providing the agent with the state, the content of all matrices is scaled between -1 and 1. Note that even though the agent gets a full view of the grid, it is still unaware of the completion degree of a task. Other works have previously used the full view of the grid (Igl et al., 2019; Jiang et al., 2021) . The grids are randomly generated. In all settings (OP, OPL, FR, FRL), the agent and the objects are randomly assigned an unoccupied position. In the case of FR and FRL, no object occupies a position between rooms nor its adjoining positions. There is a single object per object type (i.e., proposition) in OP and OPL, whereas there can be one or two per type in FR and FRL. Finally, there is a single lava location in OPL, which is randomly assigned (like the rest of the propositions), whereas in FRL there are four fixed lava locations placed in the intersections between doors as shown in Figure 9 . The WATERWORLD domain (cf. Figure 10 ) has a continuous state space. The states are vectors containing the absolute position and velocity of the agent, and the relative positions and velocities of the other balls. The agent does not know the color of each ball. In all settings (WOD and WD), a WATERWORLD instance is created by assigning a random position and direction to each ball. Like in CRAFTWORLD, the agent does not know the degree of completion of a task.

F.2 HYPERPARAMETERS

Table 2 lists some of the hyperparameters used in the experiments. The rest of the hyperparameters and other details (e.g., architectures, evaluation of other methods) are discussed in the following paragraphs. Architectures. The DQNs for CRAFTWORLD consist of a 3-layer convolutional neural network (CNN) with 16, 32 and 32 filters respectively. All kernels are 2 × 2 and use a stride of 1. In the FR and FRL settings, there is a max pooling layer with kernel size 2 × 2 after the first convolutional layer. This part of the architecture is based on that by Igl et al. (2019) and Jiang et al. (2021) , who also work on MiniGrid using the full view of the grid. In DQNs associated with formulas, the CNN's output is fed to a 3-layer multilayer perceptron (MLP) where the hidden layer has 256 rectifier units and the output layer has a single output for each action. In the case of DQNs for RMs, the output of the CNN is extended with the encoding of the RM state and the context (as discussed in Appendix C) before being fed to a 3-layer MLP where the hidden layer has 256 rectifier units and the output layer has a single output for each call in the RM. The architecture for WATERWORLD is a simple modification of the one introduced by Toro Icarte et al. (2018) . The formula DQNs consist of a 5-layer MLP, where each of the 3 hidden layers has 512 rectifier units. The DQN for the RMs share the same architecture and, like in CRAFTWORLD, the state from the environment is extended with the state and context encodings. Compression. Akin to some methods for learning RMs (Furelos-Blanco et al., 2021; Toro Icarte et al., 2019) , we compress label traces by merging consecutive equal labels into a single one. For instance, ⟨{}, { }, { }, {}, {}, { }, { }⟩ becomes ⟨{}, { }, {}, { }, { }⟩. Curriculum. The average undiscounted returns (see Section 5 and Appendix D) are updated for each task-instance pair every 100 training episodes using the undiscounted return obtained by the greedy policies in a single evaluation episode. Flat HRM Learning Methods. We briefly describe the methods used to learn flat HRMs. Each run consists of 150,000 episodes, and the set of instances is exactly the same across methods. The core difference with respect to learning non-flat HRMs is that there is a single task for which the HRM is learned. Our method, LHRM, is therefore not able to reuse previously learned HRMs for other tasks; however, it still uses the same hyperparameters. In the case of DeepSynth (Hasanbeig et al., 2021) , LRM (Toro Icarte et al., 2019) and JIRP (Xu et al., 2020) , we exclusively evaluate their RM learning components using traces collected through random walks. For a fair comparison against LHRM (both in the non-flat and flat learning cases), we (i) compress the traces using the aforementioned methodology, and (ii) use the OP and WOD settings of CRAFTWORLD and WATERWORLD respectively, where observing goal traces is very easy for simple tasks such as MILKBUCKET. In these approaches, a different instance is selected at each episode following a cyclic order (i.e., 1, 2,. . . , I -1, I, 1, 2, . . . ). The set of propositions in these approaches includes a proposition covering the case where no other propositions are observed (if needed). In the case of LRM, one of the parameters is the maximum number of RM states, which we set to that of the minimal RM. Finally, we modify DeepSynth to avoid periodically calling the learner (i.e., only call it when a counterexample trace is observed): this is not done in other approaches and usually causes timeouts unnecessarily (the same RM is repeatedly learned).foot_4  ILASP. We use ILASP2 (Law et al., 2015) to learn the HRMs. 6 For efficiency, the default calls to the underlying ASP solver are modified to be made with the flag ---opt-mode=ignore, meaning that non-minimal solutions might be obtained (i.e., solutions involving more rules than needed), so the learned root might contain some unnecessary edges. In practice, the solutions produced by ILASP rarely contain such edges and, if they do, these edges eventually disappear by observing an appropriate counterexample. We hypothesize that using this flag helps since no optimization is made every time ILASP is called. The design of the ILASP tasks is discussed in Appendix E.2. We highlight that this notion of minimality is not related to that of a minimal RM (i.e., an RM with the fewest number of states) described in Section 5.

F.3 EXTENDED RESULTS

We organize the tables and figures following the structure in Section 6. Policy Learning. Figure 11 shows the plots omitted in the main paper (the remaining CRAFT-WORLD setting, FRL, is shown in Figure 2 ). The number of plotted episodes varies across domains and tasks for clarity. Figure 12 shows the learning curves for a task called TENPAPERS, which consists of performing PAPER a total of 10 times. 7 The plot shows that exploiting the non-flat HRM leads to much faster convergence than the equivalent flat one. The difference arises from the fact that the non-flat HRM can reuse the policies for the called PAPER RM, whereas the flat one cannot. Learning Non-Flat HRMs. We present tables containing the results for the HRM learning component of LHRM. The content of the columns is the following left-to-right: (1) task name; (2) number of runs in which at least one goal trace was observed; (3) number of runs in which at least one HRM was learned; (4) time spent to learn the HRMs; (5) number of calls to ILASP made to learn the HRMs; (6) number of states of the final HRM; (7) number of edges of the final HRM; (8) number of episodes between the learning of the first HRM and the activation of the task's level; (9) number of example traces of a given type (G = goal, D = dead-end, I = incomplete); and (10) length of the example traces of a given type. In addition, the bottom of the tables contains the number of completed runs (i.e., the number of runs that have not timed out), the total time spent on learning the HRMs, and the total number of calls made to ILASP. In the case of CRAFTWORLD, Table 3 shows the results for the default case (all lower level RMs are callable and options are used for exploration), Table 4 shows the results when the set of callable RMs contains only those actually needed, and Table 5 shows the results using primitive actions for exploration instead of options. Analogous results are shown for WATERWORLD in Tables 6, 7 and 8. The discussion of these results can be found in Section 6. Learning Flat HRMs. Table 9 shows the results of learning a non-flat HRM using LHRM, and the results of learning a flat HRM using several approaches (LHRM, DeepSynth, LRM and JIRP). An extended discussion of these results can be found in Section 6.

G EXAMPLES OF HIERARCHIES OF REWARD MACHINES

Figures 14 and 15 show the minimal root RMs for the CRAFTWORLD and WATERWORLD tasks, respectively. In the case of CRAFTWORLD, since two or more propositions can never occur simultaneously, the mutual exclusivity between formulas could be enforced differently. These RMs correspond to the settings without dead-ends; thus, they do not include rejecting states. ) 20.9 (1.9) 14.3 (0.8) LEATHER 5 5 1.9 (0.2) 7.0 (0.5) 4.0 (0.0) 4.0 (0.0) 6.9 (0.5) 10.0 (0.0) 2.4 (0.2) 2.6 (0.4) 11.1 (0.9) 16.9 (5.6) 8.9 (3.3) PAPER 5 5 2.0 (0.2) 7.6 (0.6) 4.0 (0.0) 4.0 (0.0) 7.7 (1.1) 10.0 (0.0) 3.0 (0.3) 2.6 (0.4) 10.1 (0.9) 18.9 (3.3) 5.6 (0.8) QUILL 5 5 11.3 (1.2) 22.0 (1.2) 6.0 (0.0) 9.2 (0.2) 12.8 (1.5) 10.6 (0.2) 6.4 (0.7) 11.0 (0.9) 15.3 (1.3) 13.5 (1.0) 12.1 (1.4) SUGAR 5 5 1.7 (0.1) 6.4 (0.4) 4.0 (0.0) 4.0 (0.0) 6.5 (0.7) 10.0 (0.0) 2.4 (0.2) 2.0 (0. 1.2 (0.0) 4.0 (0.0) 3.0 (0.0) 2.0 (0.0) 4.9 (0.9) 10.0 (0.0) 2.0 (0.0) 12.4 (1.1) 10.9 (2.5) QUILL 5 5 8.9 (0.9) 16.0 (0.8) 5.0 (0.0) 5. ) 20.9 (1.9) 14.3 (0.8) LEATHER 5 5 1.9 (0.1) 7.0 (0.5) 4.0 (0.0) 4.0 (0.0) 6.9 (0.5) 10.0 (0.0) 2.4 (0.2) 2.6 (0.4) 11.1 (0.9) 16.9 (5.6) 8.9 (3.3) PAPER 5 5 2.0 (0.2) 7.6 (0.6) 4.0 (0.0) 4.0 (0.0) 7.7 (1.1) 10.0 (0.0) 3.0 (0.3) 2.6 (0.4) 10.1 (0.9) 18.9 (3.3) 5.6 (0.8) QUILL 5 5 11.5 (1.3) 22.0 (1.2) 6.0 (0.0) 9.2 (0.2) 12.8 (1.5) 10.6 (0.2) 6.4 (0.7) 11.0 (0.9) 15.3 (1.3) 13.5 (1.0) 12.1 (1.4) SUGAR 5 5 1.6 (0.1) 6.4 (0.4) 4.0 (0.0) 4.0 (0.0) 6.5 (0.7) 10.0 (0.0) 2.4 (0.2) 2.0 (0. 4.0 (0.0) 3.0 (0.0) 2.0 (0.0) 4.9 (0.9) 10.0 (0.0) 2.0 (0.0) 12.4 (1.1) 10.9 (2.5) QUILL 5 5 9.3 (0.8) 16.0 (0.8) 5.0 (0.0) 5.2 (0.2) 9.4 (1.7) 10.6 (0.2) 11.4 (0.6) 25.4 (0.3) 25.5 (2.7) SUGAR 5 5 1.4 (0.2) 3.8 (0.2) 3.0 (0.0) 2.0 (0.0) 5.2 (0.3) 10.0 (0.0) 1.8 (0.2) 15.3 (1.7) 21.0 (10.1) BOOK 5 5 43.8 (13.0) 20.0 (1.9) 5.0 (0.0) 5.4 (0.2) 6.0 (0.1) 10.0 (0.0) 16.0 (1.9) 21.9 (1.0) 14.7 (1.4) MAP 5 5 85.2 (13.4) 22.2 (2.5) 5.0 (0.0) 5.2 (0.2) 5.9 (0.1) 10.2 (0.2) 18.0 (2.6) 26.5 (0.9) 18.2 (1.2) MILKBUCKET 5 5 1.4 (0.1) 4.4 (0.2) 3.0 (0.0) 2.0 (0.0) 10.2 (0.9) 10.0 (0.0) 2.4 (0.2) 13.0 (0.8) 12.2 (2.8) BOOKQUILL 5 5 6.3 (0.9) 13.2 (1.7) 4.0 (0.0) 4.0 (0.0) 3.8 (0.1) 10.0 (0.0) 10. .2 (0.5) 13.4 (1.9) 6.8 (1.7) COMPASS 5 5 13.5 (1.8) 22.0 (1.7) 6.0 (0.0) 9.2 (0.2) 10.4 (1.4) 11.0 (0.6) 6.8 (1.0) 10.2 (1.0) 17.2 (1.6) 20.9 (1.9) 14.3 (0.8) LEATHER 5 5 1.8 (0.1) 7.0 (0.5) 4.0 (0.0) 4.0 (0.0) 6.9 (0.5) 10.0 (0.0) 2.4 (0.2) 2.6 (0.4) 11.1 (0.9) 16.9 (5.6) 8.9 (3.3) PAPER 5 5 2.0 (0.2) 7.6 (0.6) 4.0 (0.0) 4.0 (0.0) 7.7 (1.1) 10.0 (0.0) 3.0 (0.3) 2.6 (0.4) 10.1 (0.9) 18.9 (3.3) 5.6 (0.8) QUILL 5 5 11.8 (1.3) 22.0 (1.2) 6.0 (0.0) 9.2 (0.2) 12.8 (1.5) 10.6 (0.2) 6.4 (0.7) 11.0 (0.9) 15.3 (1.3) 13.5 (1.0) 12.1 (1.4) SUGAR 5 5 1.6 (0.1) 6.4 (0.4) 4.0 (0.0) 4.0 (0.0) 6.5 (0.7) 10.0 (0.0) 2.4 (0.2) 2.0 (0.3) 9.6 (0.6) 15.3 (3.6) 16.6 (9.2) BOOK 5 5 224.8 (71.6) 27.0 (1.9) 6.0 (0.0) 6.4 (0.2) 139.7 (21.8) 11.6 (0.4) 3.2 (0.4) 18.2 (1.4) 22.0 (1.6) 24.7 (6.5) 23.5 (1.2) MAP 5 5 339.9 (33.6) 33.0 (2.8) 6.0 (0.0) 6.4 (0. (2.5) 6.0 (0.0) 9.2 (0.2) 468.4 (121.9) 10.4 (0.2) 7.6 (0.9) 11.4 (1.9) 11.9 (0.6) 10.1 (1.3) 9.9 (0.4) BUCKET 5 5 2.2 (0.1) 7.0 (0.3) 4.0 (0.0) 4.0 (0.0) 129.5 (69.4) 10.2 (0.2) 2.8 (0.2) 2.0 (0.3) 7.8 (0.5) 9.9 (1.7) 6.4 (2.1) COMPASS 5 5 12.9 (1.9) 24.6 (2.2) 6.0 (0.0) 9.4 (0.2) 550.8 (156.4) 10.4 (0.2) 7.8 (1.0) 12.4 (1.2) 12.5 (1.6) 9.4 (1.0) 8.4 (0.5) LEATHER 5 5 2.8 (0.4) 7.8 (0.7) 4.0 (0.0) 4.0 (0.0) 89.0 (18.0) 10.0 (0.0) 3.2 (0.4) 2.6 (0.4) 7.3 (0.4) 9.3 (1.7) 3.7 (0.4) PAPER 5 5 2.1 (0.1) 7.0 (0.3) 4.0 (0.0) 4.0 (0.0) 82.7 (18.8) 10.0 (0.0) 3.0 (0.0) 2.0 (0.3) 6.9 (0.7) 10.2 (1.8) 4.7 (2.7) QUILL 5 5 11.6 (1.1) 23.8 (1.5) 6.0 (0.0) 9.6 (0.2) 458.9 (61.0) 10.6 (0.2) 8.0 (0.9) 11.2 (1.2) 11.9 (0.6) 13.1 (2.7) 9.2 (0. 

H ILLUSTRATION OF LHRM

Figure 13 illustrates the main components of our algorithm for learning HRMs, LHRM, described in Section 5. Given a set of tasks with known levels and a set of instances, we select a (task, instance) pair at the beginning of an episode. The HRM corresponding to the selected task is taken from the bank of HRMs. At each step, the agent performs an action a t+1 in the (task, instance) environment, and observes a tuple s t and a label L t . The label is used to (i) know the next hierarchy state u t+1 in the HRM and the reward r t+1 , and (ii) update the label trace ⟨L 0 , . . . , L t ⟩. If the trace is a counterexample (i.e., the tuple s t and the new hierarchy state u t+1 are inconsistent with each other),foot_7 we add it to our set of counterexample traces for the current task and learn a new HRM using the ILASP system. The learned HRM is then updated in the bank of HRMs so that future tasks can reuse it. Note that the description omits some aspects for simplicity, such as how are the (task, instance) pairs selected, the exploration using options from lower level HRMs, or the accumulation of traces before learning the first HRM for a given task. Table 9 : Results of learning non-flat and flat HRMs using different methods. The columns are the following: the number of completed runs without timing out, the amount of time needed to learn the HRMs or RMs, and the number of states and edges of the RM. 



The term q ϕ∧Φ (st+1, . . .) in Equation 1 is 0 if (s T t+1 , s G t+1 ) = (⊤, ⊥). Since experiences (st, a, st+1) are shared through the replay buffer, evaluating the condition differently can produce instabilities. We refer the reader to the 'Flattening Algorithm' description introduced later for specific details. We do not to cover the case where v is an accepting state since, by assumption, there are no outgoing transitions from it. In the case of rejecting states, we keep all of them as explained in the previous case and, therefore, there are no substitutions to be made. We also do not cover the case where w = u 0 j since the input flat machines never have edges to their initial states, but to the dummy initial state. We denote by ϕ1 ⊆ ϕ2, where ϕ1, ϕ2 ∈ DNFP , the fact that all the disjuncts of ϕ1 appear in ϕ2. This containment relationship also holds if both formulas are ⊤. For instance, (a ∧ ¬c) ⊆ (a ∧ ¬c) ∨ d. The code of these methods is publicly available: DeepSynth (https://github.com/grockious/ deepsynth, MIT license), LRM (https://bitbucket.org/RToroIcarte/lrm, no license), and JIRP (https://github.com/corazza/stochastic-reward-machines, no license). The first two links can be found in the papers, whereas the last one was provided by one of the authors through personal communication. ILASP is free to use for research and education (see https://www.ilasp.com/terms). Disclaimer: At the time of writing this revised version (November 10), this experiment consisting of 5 runs has not finished. However, we are confident the observations we have made still will hold when it is completed. For example if (s T t , s G t ) = (⊤, ⊤) and the state in ut+1 is not the accepting state of the root RM. 288.9 (31.7) 16.6 (3.1) 119.0 (19.4)



Figure 2: Learning curves for three CRAFTWORLD (FRL) tasks using handcrafted HRMs.

(d) Flattened HRM after transforming M1.

Figure 4: Example to justify the need for the preliminary transformation algorithm.

Figure 6: Example of an HRM whose root has height h r used in the proof of Theorem 2.

a ∧ ¬b ∧ ¬c ∧ ¬d (a) Tree for L = {{a}, {b}, {c}, {d}}. Tree for L = {{a}, {b}, {c}, {d}, {a, c}}.

Figure7: Examples of formula trees for different sets of literals. Note that the node a ∧ ¬b ∧ ¬c in (a) could also be a child of a ∧ ¬c (the parent depends on the insertion order).

T0, TE) :reachable(u A , M, T0, TE). satisfied(M ⊤ , T, T+1) :step(T). failed(M, T0, TE) :reachable(u R , M, T0, TE).

Figure 8: Answer sets for each of the partitions in the program P = A(H) ∪ R ∪ A(λ * ), where H is an HRM, R is the set of general rules and λ * is a label trace.

By Proposition 1, for each trace λ * ∈ Λ * where * ∈ {G, D, I}, A(H) ∪ R ∪ A(λ * ) has a unique answer set AS and (1) accept ∈ AS if and only if * = G, and (2) reject ∈ AS if and only if * = D.

Figure 9: An instance of the CRAFTWORLD grid in the FRL setting.

Figure 10: An instance of the WATERWORLD in the WOD setting (Toro Icarte et al., 2018).

Figure 11: Learning curves comparing the performance of handcrafted non-flat and flat HRMs.

Figure 12: Learning curves comparing the performance of handcrafted non-flat and flat HRMs in the TENPAPERS task (FRL setting).

0.7) 4.0 (0.0) 4.0 (0.0) 103.5 (39.5) 10.0 (0.0) 3.6 (0.4) 2.8 (0.5) 8.2 (0.7) 10.1 (1.9) 5.0 (1.1) BOOK 5 0 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) MAP 3 0 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) MILKBUCKET 5 2 4.7 (0.5) 11.0 (1.0) 4.0 (0.0) 3.0 (0.0) 885.6 (142.3) 10.0 (0.0) 2.0 (0.0) 7.0 (1.0) 8.2 (0.4) 10.2 (1.7) 8.8 (0.2) BOOKQUILL 0 0 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) MILKB.SUGAR 0 0 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) CAKE 0 0 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) Completed Runs 5 Total Time (s.)47.1 (0.9) Total Calls 106.4 (2.9)

1 (0.1) 7.6 (0.2) 4.0 (0.0) 3.0 (0.0) 11.9 (0.5) 10.0 (0.0) 2.6 (0.2) 3.0 (0.0) 7.6 (0.5) 11.8 (1.7) 10.7 (1.7) CMY 5 5 2.3 (0.2) 8.2 (0.8) 4.0 (0.0) 3.0 (0.0) 10.7 (0.2) 10.0 (0.0) 2.2 (0.46 (1.5) 20.6 (2.6) 5.0 (0.0) 5.0 (0.0) 5.0 (0.6) 10.2 (0.2) 6.4 (0.9) 11.0 (1.8) 15.0 (0.5) 12.3 (0.8) 14.2 (1.2)Completed Runs 5 Total Time (s.) 42.9 (3.7) Total Calls 115.0 (7.5)

Figure 13: Overview of LHRM.

denotes a label that cannot satisfy any formula, and • denotes something unimportant for the case. 2. If L satisfies the context of a call and the exit condition from the initial state of the called RM, push the call onto the stack and recursively apply δ

the agent gets the next state (line 7). The next hierarchy state is obtained by applying the hierarchical transition function δ H using label l(s t+1 ) (line 8). The Q-functions associated with formula options' policies are updated after this step (line 9).3. The option stackΩ H is updated by removing those options that have terminated (lines 10, 26-45). The terminated options are saved in a different list Ω β to update the Q-functions of the RMs where they were initiated later on (line 11). The termination of the options is performed as described in Section 4. All options terminate if a terminal state is reached (lines 27-28). Otherwise, we check options in Ω H from deeper to shallower levels. The first checked option is always a formula option, which terminates if the hierarchy state has changed (line 40). In contrast, a call option terminates if it does not appear in the stack (lines 33, 46-51). 4 When an option is found to terminate, it is added to Ω β and removed from Ω H (lines 35-36, 41-42). If a non-terminating option is found (lines 37, 43), we stop checking for termination (no higher level options can have terminated in this case). 4. If at least one option has terminated (line 12), the option stack is updated such that it contains all options appearing in the call stack (lines 13, 52-70). Options are derived for the full stack if Ω H is empty (lines 53, 54), or for the part of the stack not appearing in Ω H (lines 56-59). The new derived options (lines 61-70) from the call stack are assumed to start in the same state as the last terminated option (i.e., the shallowest terminated option, line 63) and to have been run for the same number of steps too. Crucially, the contexts should be propagated accordingly, starting from the context of the last terminated option (line 69).

1: s 0 ← ENV.INIT() ▷ Initial MDP tuple 2: ⟨M i , u, Φ, Γ⟩ ← ⟨M r , u 0 ⟨M j , u ′ , Φ ′ , Γ ′ ⟩ ← δ H (⟨M i , u, Φ, Γ⟩, l(s t+1 ))

the hierarchy state has changed. . .

that will help us derive the proof. Definition 11. Given a label trace λ * , where * ∈ {G, D, I}, an HRM H is valid with respect to λ * if H accepts λ * and * = G (i.e., λ * is a goal trace), or H rejects λ * and * = D (i.e., λ * is a dead-end trace), or H does not accept nor reject λ * and * = I (i.e., λ * is an incomplete trace). Definition 12. An ASP program P is stratified when there is a partitionP = P 0 ∪ P 1 ∪ • • • ∪ P n(P i and P j disjoint for all i ̸ = j) such that, (1) for every predicate p, the definition of p (all clauses with p in the head) is contained in one of the partitions P i and, (2) for each 1 ≤ i ≤ n, if a predicate occurs positively in a clause of P i then its definition is contained within j≤i P j , and if a predicate occurs negatively in a clause of P i then its definition is contained within j<i P j .Theorem 3. If an ASP program P is stratified, then it has a unique answer set.

List of hyperparameters and their values.

Results of LHRM in CRAFTWORLD for the default case.

Results of LHRM in CRAFTWORLD with a restricted set of callable RMs.

Results of LHRM in CRAFTWORLD without exploration using options.

Results of LHRM in WATERWORLD for the default case.

Results of LHRM in WATERWORLD with a restricted set of callable RMs.

Results of LHRM in WATERWORLD without exploration using options.

annex

Appendix B Proofs for Theorems 1 and 2.Appendix C The implementation details (including pseudo-code) for the policy learning algorithm outlined in Section 4. We include running examples of the algorithm to aid understanding. Appendix D The implementation details for the curriculum learning mechanism introduced in Section 5. Appendix E How the root of an HRM is learned using the ILASP inductive logic programming system, including proofs of correctness. Appendix F Details on the experiments described in Section 6, such as computational resources, implementation of the domains, training and evaluation details (e.g., hyperparameters) and extended results (e.g., tables on which the results in the main paper are based). Appendix G Handcrafted HRMs for the tasks used in the experiments.We plan to release the code if the paper is accepted..Note that each non-leaf RM M i in the hierarchy is associated with its own set of rules A(M i ), which are described as follows:• Facts state(u, M i ) indicate that u is a state of RM M i .• Facts call(u, u ′ , e, M i , M j ) indicate that edge e between states u and u ′ in RM M i is labeled with a call to RM M j . • Normal rules whose head is of the form φ(u, u ′ , e, M i , T) indicate that the transition from state u to u ′ with edge e in RM M i does not hold at step T. The body of these rules consists of a single label(p, T) literal and a step(T) atom indicating that T is a step. Commonly, variables are represented using upper case letters in ASP, which is the case of steps T here.There are some important things to take into account regarding the encoding:• There is no leaf RM M ⊤ . We later introduce the ASP rules to emulate it.• The edge identifiers e between a given pair of states (u, u ′ ) range from 1 to the total number of conjunctions/disjuncts between them. Note that in A φ we assume that the leaf RM has an index, just like the other RMs in the HRM. The index could be n since the rest are numbered from 0 to n -1. Example 3. The following rules represent the HRM in Figure 1b :. General Rules. The following sets of rules, whose union is denoted by R = ∪ 5 i=0 R i , represent how an HRM functions (e.g., how transitions are taken or the acceptance/rejection criteria). For simplicity, all initial, accepting and rejecting states are denoted by u 0 , u A and u R respectively.The rule below is the inversion of the negation of the state transition function φ. Note that the predicate for φ includes the called RM M2 as an argument.The rule set R 1 introduces the pre sat(X, M, T) predicate, which encodes the exit condition presented in Section 3 and indicates whether a call from state X of RM M can be started at time T. The first rule corresponds to the base case and indicates that if the leaf M ⊤ is called then the condition is satisfied if the associated formula is satisfied. The second rule applies to calls to non-leaf RMs, where we need to satisfy the context of the call (like in the base case), and also check whether a call from the initial state of the potentially called RM can be started.pre sat(X, M, T) :φ(X, , , M, M ⊤ , T). pre sat(X, M, T) :φ(X, , , M, M2, T), pre sat(u 0 , M2, T), M2!=M ⊤ . .The rule set R 2 introduces the reachable(X, M, TO, T2) predicate, which indicates that state X of RM M is reached between steps T0 and T2. The latter step can also be seen as the step we are Figure 14 : Root reward machines for each of the CRAFTWORLD tasks. 

