ENFORCING HARD CONSTRAINTS WITH SOFT BARRI-ERS: SAFE REINFORCEMENT LEARNING IN UNKNOWN STOCHASTIC ENVIRONMENTS

Abstract

It is quite challenging to ensure the safety of reinforcement learning (RL) agents in an unknown and stochastic environment under hard constraints that require the system state not to reach certain specified unsafe regions. Many popular safe RL methods such as those based on the Constrained Markov Decision Process (CMDP) paradigm formulate safety violations in a cost function and try to constrain the expectation of cumulative cost under a threshold. However, it is often difficult to effectively capture and enforce hard reachability-based safety constraints indirectly with such constraints on safety violation cost. In this work, we leverage the notion of barrier function to explicitly encode the hard safety constraints, and given that the environment is unknown, relax them to our design of generative-model-based soft barrier functions. Based on such soft barriers, we propose a safe RL approach that can jointly learn the environment and optimize the control policy, while effectively avoiding the unsafe regions with safety probability optimization. Experiments on a set of examples demonstrate that our approach can effectively enforce hard safety constraints and significantly outperform CMDP-based baseline methods in system safe rate measured via simulations.

1. INTRODUCTION

Reinforcement learning (RL) (Sutton & Barto, 2018) has shown promising successes in learning complex policies for games (Silver et al., 2018) , robots (Zhao et al., 2020) , and recommender systems (Afsar et al., 2021) , by maximizing a cumulative reward objective as the optimization goal. However, real-world safety-critical applications, such as autonomous cars and unmanned aerial vehicles (UAVs), still hesitate to adopt RL policies due to safety concerns. In particular, these applications often have hard safety constraints that require the system state not reach certain specified unsafe regions, e.g., autonomous cars not deviating into adjacent lanes or UAVs not colliding with trees. And it is very challenging to learn a policy via RL that can meet such hard safety constraints, especially when the environment is stochastic and unknown. In the literature, the Constrained Markov Decision Process (CMDP) (Altman, 1999) is a popular paradigm for addressing RL safety. Common CMDP-based methods encode safety constraints through a cost function of safety violations, and reduce the policy search space to where the expectation of cumulative discounted cost is less than a threshold. And various RL algorithms are proposed to adaptively solve CMDP through the primal-dual approach for the Lagrangian problem of CMDP. However, it is often hard for CMDP-based methods to enforce reachability-based hard safety constraints (i.e., system state not reaching unsafe regions) by setting indirect constraints on the expectation of cumulative cost. In particular, while reachability-based safety constraints are defined on the system state at each time point (i.e., each point on the trajectory), the CMDP constraints only enforce the cumulative behavior. In other words, the cost penalty on the system visiting the unsafe regions at certain time point may be offset by the low cost at other times. There is a recent CMDP approach addressing hard safety constraints by using the indicator function for encoding failure probability (Wagener et al., 2021) , but it requires a safe back-up policy for intervention, which is difficult to achieve in unknown environments. Safe exploration with hard safety constraint has been studied in (Wachi et al., 2018; Turchetta et al., 2016; Moldovan & Abbeel, 2012) . However, they focus on discrete state and action spaces where the hard safety constraints are defined as a set of unsafe state-action pairs that cannot be visited, different from our continuous control setting. On the other hand, current control-theoretical approaches for model-based safe RL often try to leverage formal methods to handle hard safety constraints, e.g., by establishing safety guarantees through barrier functions or control barrier functions (Luo & Ma, 2021) , or by shielding mechanisms based on reachability analysis to check whether the system may enter the unsafe regions within a time horizon (Bastani et al., 2021) . However, these approaches either require explicit known system models for barrier or shielding construction, or an initial safe policy to generate safe trajectory data in a deterministic environment. They cannot be applied to our unknown stochastic environments.

CMDP:

The expectation of cumulative cost looks good. So should be fine? Ours: You are bounded by a soft barrier function along each point on the trajectory, and is highly likely to be safe.

"Will I collide?"

Figure 1 : An RL-based robot navigation example that shows the difference between our approach and CMDP-based ones in encoding the hard safety constraints. To overcome the above challenges, we propose a safe RL framework by encoding the hard safety constraints via the learning of a generative-model-based soft barrier function. Specifically, we formulate and solve a novel bilevel optimization problem to learn the policy with joint soft barrier function learning, generative modeling, and reward optimization. The soft barrier function provides a guidance for avoiding unsafe regions based on safety probability analysis and optimization. The generative model accesses the trajectory data from the environment-policy closed-loop system with stochastic differential equation (SDE) representation to learn the dynamics and stochasticity of the environment. And we further optimize the policy by maximizing the total discounted reward of the sampled synthetic trajectories from the generative model. This joint training framework is fully differentiable and can be efficiently solved via the gradients. Compared to CMDP-based methods, our approach more directly encodes the hard safety constraints along each point of the agent trajectory through the soft barrier function, as shown in Figure 1 . While given the unknown stochastic environment, our approach cannot provide a hard barrier and hence no deterministic safety guarantee, experimental results demonstrate that in simulations, ours can significantly outperform the CMDP-based baselines in system safe rate. The paper is organized as follows. Section 2 introduces related works, Section 3 presents our approach, including the bi-level optimization formulation, our safe RL algorithm with generative modeling, soft barrier function learning, and policy optimization to solve the formulation, and theoretical analysis of safety probability. Section 4 shows the experiments and Section 5 concludes the paper.

2. RELATED WORK

Safe RL by CMDP: CMDP-based methods encode the safety violation as a cost function and set constraints on the expectation of cumulative discounted total cost. The primal-dual approaches have been widely adopted to solve the Lagrangian problem of constrained policy optimization, such as PDO (Chow et al., 2017) , OPDOP (Ding et al., 2021) , CPPO (Stooke et al., 2020) , FOCOPS (Zhang et al., 2020), and CRPO (Xu et al., 2021) . Other works leverage a world model learning (As et al., 2021) or the Lyapunov function to solve the CMDP (Chow et al., 2018) , or add a safety layer for the safety constraint (Dalal et al., 2018) . However, the constraints in CMDP cannot directly encode the hard reachability-based safety properties, which hinders its application to many safety-critical systems. A recent CMDP-based work uses the indicator function for encoding failure probability as hard safety constriants, but it requires a safe backup policy for intervention (Wagener et al., 2021) . Model-based Safe RL by Formal Methods: Formal analysis and verification techniques have been proposed in model-based safe RL to enforce the system not reach unsafe regions. Some works develop shielding mechanisms with a backup policy based on reachability analysis (Shao et al., 2021; Li & Bastani, 2020; Bastani et al., 2021) . Other works adopt (control) barrier functions or (control) Lyapunov functions for provable safety (Emam et al., 2021; Choi et al., 2020; Cheng et al., 2019; Wang et al., 2022; Ma et al., 2021; Luo & Ma, 2021; Berkenkamp et al., 2017; Taylor et al., 

