ENFORCING HARD CONSTRAINTS WITH SOFT BARRI-ERS: SAFE REINFORCEMENT LEARNING IN UNKNOWN STOCHASTIC ENVIRONMENTS

Abstract

It is quite challenging to ensure the safety of reinforcement learning (RL) agents in an unknown and stochastic environment under hard constraints that require the system state not to reach certain specified unsafe regions. Many popular safe RL methods such as those based on the Constrained Markov Decision Process (CMDP) paradigm formulate safety violations in a cost function and try to constrain the expectation of cumulative cost under a threshold. However, it is often difficult to effectively capture and enforce hard reachability-based safety constraints indirectly with such constraints on safety violation cost. In this work, we leverage the notion of barrier function to explicitly encode the hard safety constraints, and given that the environment is unknown, relax them to our design of generative-model-based soft barrier functions. Based on such soft barriers, we propose a safe RL approach that can jointly learn the environment and optimize the control policy, while effectively avoiding the unsafe regions with safety probability optimization. Experiments on a set of examples demonstrate that our approach can effectively enforce hard safety constraints and significantly outperform CMDP-based baseline methods in system safe rate measured via simulations.

1. INTRODUCTION

Reinforcement learning (RL) (Sutton & Barto, 2018) has shown promising successes in learning complex policies for games (Silver et al., 2018) , robots (Zhao et al., 2020) , and recommender systems (Afsar et al., 2021) , by maximizing a cumulative reward objective as the optimization goal. However, real-world safety-critical applications, such as autonomous cars and unmanned aerial vehicles (UAVs), still hesitate to adopt RL policies due to safety concerns. In particular, these applications often have hard safety constraints that require the system state not reach certain specified unsafe regions, e.g., autonomous cars not deviating into adjacent lanes or UAVs not colliding with trees. And it is very challenging to learn a policy via RL that can meet such hard safety constraints, especially when the environment is stochastic and unknown. In the literature, the Constrained Markov Decision Process (CMDP) (Altman, 1999) is a popular paradigm for addressing RL safety. Common CMDP-based methods encode safety constraints through a cost function of safety violations, and reduce the policy search space to where the expectation of cumulative discounted cost is less than a threshold. And various RL algorithms are proposed to adaptively solve CMDP through the primal-dual approach for the Lagrangian problem of CMDP. However, it is often hard for CMDP-based methods to enforce reachability-based hard safety constraints (i.e., system state not reaching unsafe regions) by setting indirect constraints on the expectation of cumulative cost. In particular, while reachability-based safety constraints are defined on the system state at each time point (i.e., each point on the trajectory), the CMDP constraints only enforce the cumulative behavior. In other words, the cost penalty on the system visiting the unsafe regions at certain time point may be offset by the low cost at other times. There is a recent CMDP approach addressing hard safety constraints by using the indicator function for encoding failure probability (Wagener et al., 2021) , but it requires a safe back-up policy for intervention, which is difficult to achieve in unknown environments. Safe exploration with hard safety constraint has been studied in (Wachi et al., 2018; Turchetta et al., 2016; Moldovan & Abbeel, 2012) . However,

