CONSTRAINED HIERARCHICAL DEEP REINFORCE-MENT LEARNING WITH DIFFERENTIABLE FORMAL SPECIFICATIONS

Abstract

Formal logic specifications are a useful tool to describe desired agent behavior and have been explored as a means to shape rewards in Deep Reinforcement Learning (DRL) over a variety of problems and domains. Prior reward-shaping work, however, has failed to consider the possibility of making these specifications differentiable, which would yield a more informative signal of the objective via the specification gradient. This paper examines precisely such an approach by exploring a Lagrangian method to constrain policy updates using differentiable temporal logic specifications capable of associating logic formulae with real-valued quantitative semantics. This constrained learning mechanism is then used in a hierarchical setting where a high-level specification-guided neural network path planner works with a low-level control policy to navigate through planned waypoints. The effectiveness of our approach is demonstrated over four robot configurations with five different types of Signal Temporal Logic (STL) specifications. Our demo videos are collected in https://sites.google.com/view/schrl.

1. INTRODUCTION

Specifying tasks with precise and expressive temporal logic formal specifications has a long history (Pnueli, 1977; Kloetzer & Belta, 2008; Wongpiromsarn et al., 2012; Chaudhuri et al., 2021) , but integrating these techniques into modern learning-based systems has been limited by the nondifferentiability of the formulas used to construct these specifications. In the context of Deep Reinforcement Learning (DRL), a line of recent work (Li et al., 2017; Hasanbeig et al., 2019b; Jothimurugan et al., 2019; Icarte et al., 2022) tries to circumvent this difficulty by turning Linear Temporal Logic (LTL) specifications into reward functions used to train control policies for the specified tasks. The quantitative semantics introduced yields real-valued information about the task that can then be used by reinforcement learning agents via policy gradient methods. However, the sample complexity of such policy gradient approaches limits the scalability of these algorithms, especially when it comes to extracting reward functions from complex specifications (Yang et al., 2022) . Moreover, these techniques do not consider how to effectively leverage the differentiability of the quantitative semantics associated with these specifications to yield a more accurate gradient than the policy gradient estimated from the LTL reward and samples. Interestingly, as we show in this paper, this differentiability property can indeed be leveraged to meaningfully constrain policy updates. Previous approaches (Schulman et al., 2015; 2017; Achiam et al., 2017) constrain policy updates using KL-divergence and safety surrogate functions. For example, Achiam et al. (2017); Schulman et al. (2017) use Lagrangian methods for this purpose. Based on the same Lagrangian methods, we consider how to constrain policy updates with differentiable formal specifications (Leung et al., 2020; 2022) equipped with rich quantified semantics, expressed in the language of Signal Temporal Logic (STL) (Maler & Nickovic, 2004) . This semantics gives us the ability to specify various tasks with logic formulas and realize them within a hierarchical reinforcement learning framework. Instead of burdening a single policy to satisfy formal specifications and achieve control tasks simultaneously (Li et al., 2017; Hasanbeig et al., 2019b; Jothimurugan et al., 2019) , we choose to learn a hierarchical policy. Hierarchical policies have proven to be effective in DRL with complex tasks (Sutton et al., 1999; Nachum et al., 2018; Jothimurugan et al., 2021; Icarte et al., 2022) . In contrast to previous DRL techniques integrated with LTL (Jothimurugan et al., 2021; Icarte et al., 2022) , however, we replace multiple low-level options (Sutton et al., 1999) with a single goal-conditioned policy (Schaul et al., 2015; Nachum et al., 2018) . A high-level planning policy, constrained by a formal specification, provides a sequence of goals to guide a low-level control policy to satisfy this specification. Additionally, because we wish for a learned policy to satisfy tasks as fast as possible, the high-level policy and low-level policy are jointly trained for both the satisfaction rate and the number of steps required to complete the objective. Finally, we also show novel applications of a neural-ODE policy as a high-level policy and integrate neural network-based predicate functions as part of our specification framework. Our contributions are as follows. (1) We propose a programmable hierarchical reinforcement learning framework constrained by differentiable STL specifications, which avoids the sample complexity challenges of previous reward shaping work, and scales to benchmarks with high dimensional environments. (2) We show that the joint training of the high-level and low-level policy in this hierarchical framework provides better performance than training these components individually. (3) We demonstrate that our framework can be easily extended with neural predicates for complex specifications, such as irregular geometric obstacles that would be difficult to specify using purely symbolic primitives. -Conditioned DRL Schaul et al. (2015) shows that training a single policy conditioned by multiple goals using only one neural network is feasible and can act as a universal option model (Yao et al., 2014) . Given a goal g and an agent observation o t , the action a t = π(o t | g) is predicted by the goal-condition policy π(o t | g). Intuitively, a t leads an agent to get "closer" to the goal g. Iteratively calling policy π(o t | g) in a loop will finally lead an agent to reach goal g. Two-layer Hierarchical DRL Combining a high-level planning policy with a low-level control policy can often expand the range of problems solvable by DRL algorithms (Florensa et al., 2017; Shu et al., 2018) . Given a high-level planning policy π h : G n → G mapping all historical goals to the next goal, a low-level policy π l : O, G → A maps the observation conditioned by a goal to action space A. At time step t, supposing that the low-level policy is toward i-th goal, the action is computed with a t = π l (o t |g i ), where g i = π h ([g 0 , . . . , g i-1 ]). In this work, the low-level policy π l is also called the control policy, and the high-level policy is identical to the planning policy.

Goal

Lagrangian Methods in Constrained DRL Lagrangian methods solve constrained maximization problems. For a real vector x, consider the equality-constrained problem: max x f (x) s.t. h(x) = 0. This can be expressed as an unconstrained problem with the Lagrange multiplier λ. Let L(x, λ) = f (x) + λh(x), (x * , λ * ) = arg min λ max x L(x, λ), which can be solved by iteratively updating the primal variable x and dual variable λ with gradients (Stooke et al., 2020) . The λ here acts as "dynamic" penalty coefficients for updating on real vector x. Lagrangian methods are widely used in the policy gradient update of many popular constrained DRL algorithms (Schulman et al., 2017; Achiam et al., 2017) . Note that to compute the gradient through the constraint function h(x), the function h must be differentiable. Since we want to constrain training with formal specifications, we, therefore, introduce a differentiable formal specification language in the following sections. TLTL Syntax and Operator Semantics The syntax of TLTL contains both first-order logic operators ∧ (and), ¬ (not), ∨ (or), ⇒ (implies), etc., and temporal operators ⃝ (next), ♢ 



[a,b]  (eventually), □[a,b]  (globally), U [a,b] (until). The initial time a and end time b "truncate" a path. For example, □[a,b]  qualifies property that globally holds during time a and b. The syntax of TLTL is recursively defined via the following grammar:ϕ :=⊤ | ⊥ | P | ¬ϕ | ϕ ∧ ψ | ϕ ∨ ψ | ϕ ⇒ ψ | ⃝ϕ | ♢ [a,b] ϕ | □ [a,b] ϕ | ϕ U [a,b] ψ

