LOGICAL OPTIONS FRAMEWORK

Abstract

Learning composable policies for environments with complex rules and tasks is a challenging problem. We introduce a hierarchical reinforcement learning framework called the Logical Options Framework (LOF) that learns policies that are satisfying, optimal, and composable. LOF efficiently learns policies that satisfy tasks by representing the task as an automaton and integrating it into learning and planning. We provide and prove conditions under which LOF will learn satisfying, optimal policies. And lastly, we show how LOF's learned policies can be composed to satisfy unseen tasks with only 10-50 retraining steps. We evaluate LOF on four tasks in discrete and continuous domains.

1. INTRODUCTION

To operate in the real world, intelligent agents must be able to make long-term plans by reasoning over symbolic abstractions while also maintaining the ability to react to low-level stimuli in their environment (Zhang & Sridharan, 2020) . Many environments obey rules that can be represented as logical formulae; e.g., the rules a car follows while driving, or a recipe a chef follows to cook a dish. Traditional motion and path planning techniques struggle to formulate plans over these kinds of long-horizon tasks, but hierarchical approaches such as hierarchical reinforcement learning (HRL) can solve lengthy tasks by planning over both the high-level rules and the low-level environment. However, solving these problems involves trade-offs among multiple desirable properties, which we identify as satisfaction, optimality, and composability (described below). Most of today's algorithms sacrifice at least one of these objectives. For example, Reward Machines from Icarte et al. ( 2018) is satisfying and optimal, but not composable; the options framework (Sutton et al., 1999) is composable and hierarchically optimal, but cannot satisfy specifications. We introduce a new approach called the Logical Options Framework, which builds upon the options framework and aims to combine symbolic reasoning and low-level control to achieve satisfaction, optimality, and composability with as few compromises as possible. Furthermore, we show that our framework is compatible with a large variety of domains and planning algorithms, from discrete domains and value iteration to continuous domains and proximal policy optimization (PPO). Satisfaction: An agent operating in an environment governed by rules must be able to satisfy the specified rules. Satisfaction is a concept from formal logic, in which the input to a logical formula causes the formula to evaluate to True. Logical formulae can encapsulate rules and tasks like the ones described in Fig. 1 , such as "pick up the groceries" and "do not drive into a lake". In this paper, we state conditions under which our method is guaranteed to learn satisfying policies. Optimality: Optimality requires that the agent maximize its expected cumulative reward for each episode. In general, satisfaction can be achieved by rewarding the agent for satisfying the rules of the environment. In hierarchical planning there are several types of optimality, including hierarchical optimality (optimal with respect to the hierarchy) and optimality (optimal with respect to everything). We prove in this paper that our method is hierarchically optimal and, under certain conditions, optimal. Composability: Our method also has the property of composability -once it has learned to satisfy a task, the learned model can be rearranged to satisfy a large variety of related tasks. More specifically, the rules of an environment can be factored into liveness and safety properties, which we discuss in Sec. 3. The learned model can be adapted to satisfy any appropriate new liveness property. A shortcoming of many RL models is that they are not composable -trained to solve one specific task, they are incapable of handling even small variations in the task structure. However, the real world is a dynamic and unpredictable place, so the ability to automatically reason over as-yet-unseen tasks and rules is a crucial element of intelligence. "Go grocery shopping, pick up the kid, and go home, unless your partner calls telling you that they will pick up the kid, in which case just go grocery shopping and then go home. And don't drive into the lake. " (b) The FSA representing the natural language instructions. The propositions are divided into "subgoal", "safety", and "event" propositions.

(c)

The low-level MDP and corresponding policy that satisfies the instructions. The illustrations in Fig. 1 give an overview of our work. The environment is a world with a grocery store, your (hypothetical) kid, your house, and some lakes, and in which you, the agent, are driving a car. The propositions are divided into "subgoals", representing events that can be achieved, such as going grocery shopping; "safety" propositions, representing events that must be avoided (driving into a lake); and "event" propositions, corresponding to events that you have no control over (receiving a phone call) (Fig. 1b ). In this environment, you have to follow rules (Fig. 1a ). These rules can be converted into a logical formula, and from there into a finite state automaton (FSA) (Fig. 1b ). The Logical Options Framework learns an option for each subgoal (illustrated by the arrows in Fig. 1c ), and a metapolicy for choosing amongst the options to reach the goal state of the FSA. After learning, the options can be recombined to fulfill other tasks.

1.1. CONTRIBUTIONS

We introduce the Logical Options Framework (LOF), which makes four contributions to the hierarchical reinforcement learning literature: 1. The definition of a hierarchical semi-Markov Decision Process (SMDP) that is the product of a logical FSA and a low-level environment MDP. 2. A planning algorithm for learning options and metapolicies for the SMDP that allows for the options to be composed to solve new tasks with only 10-50 retraining steps. 3. Conditions and proofs for achieving satisfaction and optimality. 4. Experiments on a discrete domain and a continuous domain on four tasks demonstrating satisfaction, optimality, and composability.

2. BACKGROUND

Linear Temporal Logic: We use linear temporal logic (LTL) to formally specify rules (Clarke et al., 2001) . LTL formulae are used only indirectly in LOF, as they are converted into automata that the



(a) These natural language instructions can be transformed into an FSA, shown in (b).

Figure 1: Many parents face this task after school ends -who picks up the kid, and who gets groceries? The pictorial symbols represent propositions, which are true or false depending on the state of the environment. The arrows in (c) represent subpolicies, and the colors of the arrows match the corresponding transition in the FSA. The boxed phone at the beginning of some of the arrows represents how these subpolicies can occur only after the agent receives a phone call.

