CONSTRAINED HIERARCHICAL DEEP REINFORCE-MENT LEARNING WITH DIFFERENTIABLE FORMAL SPECIFICATIONS

Abstract

Formal logic specifications are a useful tool to describe desired agent behavior and have been explored as a means to shape rewards in Deep Reinforcement Learning (DRL) over a variety of problems and domains. Prior reward-shaping work, however, has failed to consider the possibility of making these specifications differentiable, which would yield a more informative signal of the objective via the specification gradient. This paper examines precisely such an approach by exploring a Lagrangian method to constrain policy updates using differentiable temporal logic specifications capable of associating logic formulae with real-valued quantitative semantics. This constrained learning mechanism is then used in a hierarchical setting where a high-level specification-guided neural network path planner works with a low-level control policy to navigate through planned waypoints. The effectiveness of our approach is demonstrated over four robot configurations with five different types of Signal Temporal Logic (STL) specifications. Our demo videos are collected in https://sites.google.com/view/schrl.

1. INTRODUCTION

Specifying tasks with precise and expressive temporal logic formal specifications has a long history (Pnueli, 1977; Kloetzer & Belta, 2008; Wongpiromsarn et al., 2012; Chaudhuri et al., 2021) , but integrating these techniques into modern learning-based systems has been limited by the nondifferentiability of the formulas used to construct these specifications. In the context of Deep Reinforcement Learning (DRL), a line of recent work (Li et al., 2017; Hasanbeig et al., 2019b; Jothimurugan et al., 2019; Icarte et al., 2022) tries to circumvent this difficulty by turning Linear Temporal Logic (LTL) specifications into reward functions used to train control policies for the specified tasks. The quantitative semantics introduced yields real-valued information about the task that can then be used by reinforcement learning agents via policy gradient methods. However, the sample complexity of such policy gradient approaches limits the scalability of these algorithms, especially when it comes to extracting reward functions from complex specifications (Yang et al., 2022) . Moreover, these techniques do not consider how to effectively leverage the differentiability of the quantitative semantics associated with these specifications to yield a more accurate gradient than the policy gradient estimated from the LTL reward and samples. Interestingly, as we show in this paper, this differentiability property can indeed be leveraged to meaningfully constrain policy updates. Previous approaches (Schulman et al., 2015; 2017; Achiam et al., 2017) constrain policy updates using KL-divergence and safety surrogate functions. For example, Achiam et al. (2017); Schulman et al. (2017) use Lagrangian methods for this purpose. Based on the same Lagrangian methods, we consider how to constrain policy updates with differentiable formal specifications (Leung et al., 2020; 2022) equipped with rich quantified semantics, expressed in the language of Signal Temporal Logic (STL) (Maler & Nickovic, 2004) . This semantics gives us the ability to specify various tasks with logic formulas and realize them within a hierarchical reinforcement learning framework. Instead of burdening a single policy to satisfy formal specifications and achieve control tasks simultaneously (Li et al., 2017; Hasanbeig et al., 2019b; Jothimurugan et al., 2019) , we choose to learn a hierarchical policy. Hierarchical policies have proven to be effective in DRL with complex tasks 1

