CONSTRAINED REINFORCEMENT LEARNING FOR SAFETY-CRITICAL TASKS VIA SCENARIO-BASED PROGRAMMING

Abstract

Deep reinforcement learning (DRL) has achieved groundbreaking successes in various applications, including robotics. A natural consequence is the adoption of this paradigm for safety-critical tasks, where human safety and expensive hardware can be involved. In this context, it is crucial to optimize the performance of DRL-based agents while providing guarantees about their behavior. This paper presents a novel technique for incorporating domain-expert knowledge into a constrained DRL training loop. Our technique exploits the scenario-based programming paradigm, designed to specify such knowledge in a simple and intuitive way. While our approach can be considered general purpose, we validated our method by performing experiments on a synthetic set of benchmark environments, and the popular robotic mapless navigation problem, in simulation and on the actual platform. Our results demonstrate that using our approach to leverage expert knowledge dramatically improves the safety and performance of the agent.

1. INTRODUCTION

In recent years, deep neural networks (DNNs) have achieved state-of-the-art results in a large variety of tasks, including image recognition (Du, 2018) , game playing (Mnih et al., 2013) , protein folding (Jumper et al., 2021) , and more. In particular, deep reinforcement learning (DRL) (Sutton & Barto, 2018) has emerged as a popular paradigm for training DNNs that perform complex tasks through continuous interaction with their environment. Indeed, DRL models have proven remarkably useful in robotic control tasks, such as navigation (Kulhánek et al., 2019) and manipulation (Nguyen & La, 2019; Corsi et al., 2021) , where they often outperform classical algorithms (Zhu & Zhang, 2021) . The success of DRL-based systems has naturally led to their integration as control policies in safety-critical tasks, such as autonomous driving (Sallab et al., 2017) , surgical assistance (Pore et al., 2021) , flight control (Koch et al., 2019) , and more. Consequently, the learning community has been seeking to create DRL-based controllers that simultaneously demonstrate high performance and high reliability; i.e., are able to perform their primary tasks while adhering to some prescribed properties, such as safety and robustness. An emerging family of approaches for achieving these two goals, known as constrained DRL (Achiam et al., 2017) , attempts to simultaneously optimize two functions: the reward, which encodes the main objective of the task; and the cost, which represents the safety constraints. Current state-of-the-art algorithms include IPO (Liu et al., 2020 ), SOS (Marchesini et al., 2021b) , CPO (Achiam et al., 2017) , and Lagrangian approaches (Ray et al., 2019) . Despite their success in some applications, these methods generally suffer from significant setbacks: (i) there is no uniform and human-readable way of defining the required safety constraints; (ii) it is unclear how to encode these constraints as a signal for the training algorithm; and (iii) there is no clear method for balancing cost and reward during training, and thus there is a risk of producing sub-optimal policies. In this paper, we present a novel approach for addressing these challenges, by enabling users to encode constraints into the DRL training loop in a simple yet powerful way. Our approach generates policies that strictly adhere to these user-defined constraints without compromising performance. We achieve this by extending and integrating two approaches: the Lagrangian-PPO algorithm (Ray et al., 2019) for DRL training, and the scenario-based programming (SBP) (Damm & Harel, 2001; 1 

