CONSTRAINED REINFORCEMENT LEARNING FOR SAFETY-CRITICAL TASKS VIA SCENARIO-BASED PROGRAMMING

Abstract

Deep reinforcement learning (DRL) has achieved groundbreaking successes in various applications, including robotics. A natural consequence is the adoption of this paradigm for safety-critical tasks, where human safety and expensive hardware can be involved. In this context, it is crucial to optimize the performance of DRL-based agents while providing guarantees about their behavior. This paper presents a novel technique for incorporating domain-expert knowledge into a constrained DRL training loop. Our technique exploits the scenario-based programming paradigm, designed to specify such knowledge in a simple and intuitive way. While our approach can be considered general purpose, we validated our method by performing experiments on a synthetic set of benchmark environments, and the popular robotic mapless navigation problem, in simulation and on the actual platform. Our results demonstrate that using our approach to leverage expert knowledge dramatically improves the safety and performance of the agent.

1. INTRODUCTION

In recent years, deep neural networks (DNNs) have achieved state-of-the-art results in a large variety of tasks, including image recognition (Du, 2018) , game playing (Mnih et al., 2013) , protein folding (Jumper et al., 2021) , and more. In particular, deep reinforcement learning (DRL) (Sutton & Barto, 2018) has emerged as a popular paradigm for training DNNs that perform complex tasks through continuous interaction with their environment. Indeed, DRL models have proven remarkably useful in robotic control tasks, such as navigation (Kulhánek et al., 2019) and manipulation (Nguyen & La, 2019; Corsi et al., 2021) , where they often outperform classical algorithms (Zhu & Zhang, 2021) . The success of DRL-based systems has naturally led to their integration as control policies in safety-critical tasks, such as autonomous driving (Sallab et al., 2017) , surgical assistance (Pore et al., 2021 ), flight control (Koch et al., 2019) , and more. Consequently, the learning community has been seeking to create DRL-based controllers that simultaneously demonstrate high performance and high reliability; i.e., are able to perform their primary tasks while adhering to some prescribed properties, such as safety and robustness. An emerging family of approaches for achieving these two goals, known as constrained DRL (Achiam et al., 2017) , attempts to simultaneously optimize two functions: the reward, which encodes the main objective of the task; and the cost, which represents the safety constraints. Current state-of-the-art algorithms include IPO (Liu et al., 2020 ), SOS (Marchesini et al., 2021b) , CPO (Achiam et al., 2017) , and Lagrangian approaches (Ray et al., 2019) . Despite their success in some applications, these methods generally suffer from significant setbacks: (i) there is no uniform and human-readable way of defining the required safety constraints; (ii) it is unclear how to encode these constraints as a signal for the training algorithm; and (iii) there is no clear method for balancing cost and reward during training, and thus there is a risk of producing sub-optimal policies. In this paper, we present a novel approach for addressing these challenges, by enabling users to encode constraints into the DRL training loop in a simple yet powerful way. Our approach generates policies that strictly adhere to these user-defined constraints without compromising performance. We achieve this by extending and integrating two approaches: the Lagrangian-PPO algorithm (Ray et al., 2019) for DRL training, and the scenario-based programming (SBP) (Damm & Harel, 2001; Harel et al., 2012b) framework for encoding user-defined constraints. Scenario-based programming is a software engineering paradigm intended to allow engineers to create a complex system in a way that is aligned with how humans perceive that system. A scenario-based program is comprised of scenarios, each of which describes a single desirable (or undesirable) behavior of the system at hand; and these scenarios are then combined to run simultaneously, in order to produce cohesive system behavior. We show how such scenarios can be used to directly incorporate subject-matterexpert (SME) knowledge into the training process, thus forcing the resulting agent's behavior to abide various safety, efficiency and predictability requirements. In order to demonstrate the usefulness of our approach to safety-critical tasks, we used it to train a policy for performing mapless navigation (Zhang et al., 2017; Tai et al., 2017) for robotics by the Robotis Turtlebot3 platform. While common DRL-training techniques were shown to give rise to high-performance policies for this task (Marchesini & Farinelli, 2020), these policies are often unsafe, inefficient, or unpredictable, thus dramatically limiting their potential deployment in realworld systems (Marchesini et al., 2021a; b) . Our experiments demonstrate that, by using our novel approach and injecting subject-matter expert knowledge into the training process, we are able to generate trustworthy policies that are both safe and high performance. To have a complete assessment of the resulting behaviors, we performed a formal verification analysis, following methods such as with (Katz et al., 2017; Liu et al., 2019) , of various predefined safety properties that proved that our approach generates safe agents to deploy in any environment.

2. BACKGROUND

Deep Reinforcement Learning. Deep reinforcement learning (Li, 2017) is a specific paradigm for training deep neural networks (Goodfellow et al., 2016) . In DRL, the training objective is to find a policy that maximizes the expected cumulative discounted reward R t = E t γ t • r t , where γ ∈ 0, 1 is the discount factor, a hyperparameter that controls the impact of past decisions on the total expected reward. The policy, denoted as π θ , is a probability distribution that depends on the parameters θ of the DNN, which maps an observed environment state s to an action a. Proximal policy optimization (PPO) is a state-of-the-art DRL algorithm for producing π θ (Schulman et al., 2017) . A key characteristic of PPO is that it limits the gradient step size between two consecutive policy updates during training, to avoid changes that can drastically modify π θ (Schulman et al., 2015) . In mission-critical tasks, the concept of optimality often goes beyond the maximization of a reward, and also involves "hard" safety constraints that the agent must respect. A constrained markov decision process (CMDP) is an alternative framework for sequential decision making, which includes an additional signal: the cost function, defined as C : S × A → R, whose expected values must remain below a given threshold d ∈ R. CMDP can support multiple cost functions and their thresholds, denoted by {C k } and {d k }, respectively. The set of valid policies for a CMDP is defined as: Π C := {π θ ∈ Π : ∀k, J C k (π θ ) ≤ d k } (1) where J C k (π θ ) is the expected sum of the k th cost function over the trajectory and d k is the corresponding threshold. Intuitively, the objective is to find a policy function that respects the constraints (i.e., is valid) and which also maximizes the expected reward (i.e., is optimal). A natural way to encode constraints in a classical optimization problem is by using Lagrange multipliers. Specifically, in DRL, a possible approach is to transform the constrained problem into the corresponding dual unconstrained version (Liu et al., 2020; Achiam et al., 2017) . The optimization problem can then be encoded as follows: J(θ) = min π θ max λ≥0 L(π θ , λ) = min π θ max λ≥0 J R (π θ ) - K λ k (J C k (π θ ) -d k ) (2) Crucially, the optimization of the function J(θ) can be carried out by applying any policy gradient algorithm, a common implementation is based on PPO (Ray et al., 2019) . Scenario-Based Programming. Scenario-based programming (SBP) (Damm & Harel, 2001; Harel & Marelly, 2003) is a paradigm designed to facilitate the development of reactive systems, by allowing engineers to program a system in a way that is close to how it is perceived by humans - The supplementary material includes the appendices. The code will be released upon publication.

