HIERARCHIES OF REWARD MACHINES

Abstract

Reward machines (RMs) are a recent formalism for representing the reward function of a reinforcement learning task through a finite-state machine whose edges encode landmarks of the task using high-level events. The structure of RMs enables the decomposition of a task into simpler and independently solvable subtasks that help tackle long-horizon and/or sparse reward tasks. We propose a formalism for further abstracting the subtask structure by endowing an RM with the ability to call other RMs, thus composing a hierarchy of RMs (HRM). We exploit HRMs by treating each call to an RM as an independently solvable subtask using the options framework, and describe a curriculum-based method to learn HRMs from traces observed by the agent. Our experiments reveal that exploiting a handcrafted HRM leads to faster convergence than with a flat HRM, and that learning an HRM is feasible in cases where its equivalent flat representation is not.

1. INTRODUCTION

More than a decade ago, Dietterich et al. (2008) argued for the need to "learn at multiple time scales simultaneously, and with a rich structure of events and durations". Finite-state machines (FSMs) are a simple yet powerful formalism for abstractly representing temporal tasks in a structured manner. One of the most prominent types of FSMs used in reinforcement learning (RL; Sutton & Barto, 2018) are reward machines (RMs; Toro Icarte et al., 2018; 2022) , where each edge is labeled with (i) a formula over a set of high-level events that capture a task's subgoal, and (ii) a reward for satisfying the formula. Hence, RMs fulfill the need for structuring events and durations. Hierarchical reinforcement learning (HRL; Barto & Mahadevan, 2003) frameworks, such as options (Sutton et al., 1999) , have been applied over RMs to learn policies at two levels of abstraction: (i) select a formula (i.e., subgoal) from a given RM state to complete the overall task, and (ii) select an action to satisfy the chosen formula (Toro Icarte et al., 2018; Furelos-Blanco et al., 2021) . Thus, RMs also allow learning at multiple scales simultaneously. The subtask decomposition powered by HRL eases the handling of long-horizon and sparse reward tasks. Besides, RMs can act as an external memory in partially observable tasks by keeping track of the subgoals achieved so far and those to be achieved. In this work, we make the following contributions: 1. Enhance the abstraction power of RMs by defining hierarchies of RMs (HRMs), where constituent RMs can call other RMs (Section 3). We prove that any HRM can be transformed into an equivalent flat HRM that behaves exactly like the original RMs. We show that under certain conditions, the equivalent flat HRM can have exponentially more states and edges. 2. Propose an HRL algorithm to exploit HRMs by treating each call as a subtask (Section 4). Learning policies in HRMs further fulfills the desiderata posed by Dietterich et al. since (i) there is an arbitrary number of time scales to learn across (not only two), and (ii) there is a richer range of increasingly abstract events and durations. Besides, hierarchies enable modularity and, hence, the reusability of the RMs and policies. Empirically, we show that leveraging a handcrafted HRM enables faster convergence than an equivalent flat HRM. 3. Introduce a curriculum-based method for learning HRMs from traces given a set of hierarchically composable tasks (Section 5). In line with the theory (Contribution 1), our experiments reveal that decomposing an RM into several is crucial to make its learning feasible (i.e., the flat HRM cannot be efficiently learned from scratch) since (i) the constituent RMs are simpler (i.e., they have fewer states and edges), and (ii) previously learned RMs can be used to efficiently explore the environment in the search for traces in more complex tasks. 1

