ENFORCING HARD CONSTRAINTS WITH SOFT BARRI-ERS: SAFE REINFORCEMENT LEARNING IN UNKNOWN STOCHASTIC ENVIRONMENTS

Abstract

It is quite challenging to ensure the safety of reinforcement learning (RL) agents in an unknown and stochastic environment under hard constraints that require the system state not to reach certain specified unsafe regions. Many popular safe RL methods such as those based on the Constrained Markov Decision Process (CMDP) paradigm formulate safety violations in a cost function and try to constrain the expectation of cumulative cost under a threshold. However, it is often difficult to effectively capture and enforce hard reachability-based safety constraints indirectly with such constraints on safety violation cost. In this work, we leverage the notion of barrier function to explicitly encode the hard safety constraints, and given that the environment is unknown, relax them to our design of generative-model-based soft barrier functions. Based on such soft barriers, we propose a safe RL approach that can jointly learn the environment and optimize the control policy, while effectively avoiding the unsafe regions with safety probability optimization. Experiments on a set of examples demonstrate that our approach can effectively enforce hard safety constraints and significantly outperform CMDP-based baseline methods in system safe rate measured via simulations.

1. INTRODUCTION

Reinforcement learning (RL) (Sutton & Barto, 2018) has shown promising successes in learning complex policies for games (Silver et al., 2018) , robots (Zhao et al., 2020) , and recommender systems (Afsar et al., 2021) , by maximizing a cumulative reward objective as the optimization goal. However, real-world safety-critical applications, such as autonomous cars and unmanned aerial vehicles (UAVs), still hesitate to adopt RL policies due to safety concerns. In particular, these applications often have hard safety constraints that require the system state not reach certain specified unsafe regions, e.g., autonomous cars not deviating into adjacent lanes or UAVs not colliding with trees. And it is very challenging to learn a policy via RL that can meet such hard safety constraints, especially when the environment is stochastic and unknown. In the literature, the Constrained Markov Decision Process (CMDP) (Altman, 1999 ) is a popular paradigm for addressing RL safety. Common CMDP-based methods encode safety constraints through a cost function of safety violations, and reduce the policy search space to where the expectation of cumulative discounted cost is less than a threshold. And various RL algorithms are proposed to adaptively solve CMDP through the primal-dual approach for the Lagrangian problem of CMDP. However, it is often hard for CMDP-based methods to enforce reachability-based hard safety constraints (i.e., system state not reaching unsafe regions) by setting indirect constraints on the expectation of cumulative cost. In particular, while reachability-based safety constraints are defined on the system state at each time point (i.e., each point on the trajectory), the CMDP constraints only enforce the cumulative behavior. In other words, the cost penalty on the system visiting the unsafe regions at certain time point may be offset by the low cost at other times. There is a recent CMDP approach addressing hard safety constraints by using the indicator function for encoding failure probability (Wagener et al., 2021) , but it requires a safe back-up policy for intervention, which is difficult to achieve in unknown environments. Safe exploration with hard safety constraint has been studied in (Wachi et al., 2018; Turchetta et al., 2016; Moldovan & Abbeel, 2012) . However, they focus on discrete state and action spaces where the hard safety constraints are defined as a set of unsafe state-action pairs that cannot be visited, different from our continuous control setting. On the other hand, current control-theoretical approaches for model-based safe RL often try to leverage formal methods to handle hard safety constraints, e.g., by establishing safety guarantees through barrier functions or control barrier functions (Luo & Ma, 2021) , or by shielding mechanisms based on reachability analysis to check whether the system may enter the unsafe regions within a time horizon (Bastani et al., 2021) . However, these approaches either require explicit known system models for barrier or shielding construction, or an initial safe policy to generate safe trajectory data in a deterministic environment. They cannot be applied to our unknown stochastic environments.

CMDP:

The expectation of cumulative cost looks good. So should be fine? Ours: You are bounded by a soft barrier function along each point on the trajectory, and is highly likely to be safe.

"Will I collide?"

Figure 1: An RL-based robot navigation example that shows the difference between our approach and CMDP-based ones in encoding the hard safety constraints. To overcome the above challenges, we propose a safe RL framework by encoding the hard safety constraints via the learning of a generative-model-based soft barrier function. Specifically, we formulate and solve a novel bilevel optimization problem to learn the policy with joint soft barrier function learning, generative modeling, and reward optimization. The soft barrier function provides a guidance for avoiding unsafe regions based on safety probability analysis and optimization. The generative model accesses the trajectory data from the environment-policy closed-loop system with stochastic differential equation (SDE) representation to learn the dynamics and stochasticity of the environment. And we further optimize the policy by maximizing the total discounted reward of the sampled synthetic trajectories from the generative model. This joint training framework is fully differentiable and can be efficiently solved via the gradients. Compared to CMDP-based methods, our approach more directly encodes the hard safety constraints along each point of the agent trajectory through the soft barrier function, as shown in Figure 1 . While given the unknown stochastic environment, our approach cannot provide a hard barrier and hence no deterministic safety guarantee, experimental results demonstrate that in simulations, ours can significantly outperform the CMDP-based baselines in system safe rate. The paper is organized as follows. Section 2 introduces related works, Section 3 presents our approach, including the bi-level optimization formulation, our safe RL algorithm with generative modeling, soft barrier function learning, and policy optimization to solve the formulation, and theoretical analysis of safety probability. Section 4 shows the experiments and Section 5 concludes the paper.

2. RELATED WORK

Safe RL by CMDP: CMDP-based methods encode the safety violation as a cost function and set constraints on the expectation of cumulative discounted total cost. The primal-dual approaches have been widely adopted to solve the Lagrangian problem of constrained policy optimization, such as PDO (Chow et al., 2017) , OPDOP (Ding et al., 2021) , CPPO (Stooke et al., 2020) , FOCOPS (Zhang et al., 2020) , and CRPO (Xu et al., 2021) . Other works leverage a world model learning (As et al., 2021) or the Lyapunov function to solve the CMDP (Chow et al., 2018) , or add a safety layer for the safety constraint (Dalal et al., 2018) . However, the constraints in CMDP cannot directly encode the hard reachability-based safety properties, which hinders its application to many safety-critical systems. A recent CMDP-based work uses the indicator function for encoding failure probability as hard safety constriants, but it requires a safe backup policy for intervention (Wagener et al., 2021) . Model-based Safe RL by Formal Methods: Formal analysis and verification techniques have been proposed in model-based safe RL to enforce the system not reach unsafe regions. Some works develop shielding mechanisms with a backup policy based on reachability analysis (Shao et al., 2021; Li & Bastani, 2020; Bastani et al., 2021) . Other works adopt (control) barrier functions or (control) Lyapunov functions for provable safety (Emam et al., 2021; Choi et al., 2020; Cheng et al., 2019; Wang et al., 2022; Ma et al., 2021; Luo & Ma, 2021; Berkenkamp et al., 2017; Taylor et 2020) . Moreover, recent work (Yu et al., 2022) adopts reachability analysis with CMDP to compute safe feasible sets. However, these methods either require known dynamics, a safe initial/backup policy, or human intervention, and thus do not apply to our setting. Barrier Function for Safety: Barrier function is introduced as a safety certificate afflicted to the control policy for deterministic and stochastic systems (Prajna & Jadbabaie, 2004; Prajna et al., 2004) . In classical control, finding a barrier function is time-consuming and requires a lot of manual effort, where a common idea is to relax the conditions of barrier function into optimization formulations such as linear programming (Yang et al., 2016 ), quadratic programming (Ames et al., 2016) , and sum-of-square programming (Wang et al., 2022) . However, these optimization-based approaches can hardly scale to high-dimensional systems. To this end, recent works have shown great promise in jointly training barrier function and safe policy by neural network representation for better scalability (Qin et al., 2021) . Our approach leverages the paradigm of barrier function, but develops the concept of soft barrier to address unknown stochastic environments. RL with Generative Model: Previous works of generative-model-based RL mainly focus on sample efficiency and policy optimization for the total expected return (Agarwal et al., 2020b; Li et al., 2020a; Tirinzoni et al., 2020) . Some works (HasanzadeZonuzy et al., 2021; Maeda et al., 2021) address safe RL by CMDP with a generative model, but only solve the tabular discrete state and action space. Besides policy optimization, the generative model in our framework also plays an important role in building a soft barrier function to facilitate the probabilistic safety analysis and optimization.

3. OUR APPROACH

In this section, we present our framework for safe RL in unknown stochastic environment that enforces hard safety constraints with soft barrier functions. In Section 3.1, we present our bi-level optimization formulation for the problem, which maximizes a total expected return while trying to avoid unsafe regions. Specifically, we encode the hard safety constraints with a novel generativemodel-based soft barrier function in the lower problem and maximize the performance with generative model learning in the upper problem. We then present our safe RL algorithm to solve the bi-level optimization formulation, by jointly learning the generative model (Section 3.2), soft barrier function (Section 3.3), and policy optimization (Section 3.4) via first-order gradient, as shown in Figure 2 . We conduct theoretical analysis for safety probability of the learned policy in Section 3.5.

3.1. BILEVEL OPTIMIZATION PROBLEM FORMULATION FOR SAFE RL

We assume that the environment can be abstracted as a  1: For k in 0, • • • , N 2: For i in 0, • • • , M 3: Sample processes τ i θ by policy π θ with M and synthetic processes τ i θ,α by π θ with Mθ,α .

4:

Compute generative loss function L g with τ i θ and τ i θ,α as in Equation 3, α ← α -∂Lg ∂α . 5: Compute barrier function loss L B by sampling synthetic τ k θ,α as in Equation 4. 6: Compute total discount reward Ĵ(π θ ) by sampling synthetic τ k θ,α as in Equation 5. 7: θ ← θ -∂L B ∂θ + ∂ Ĵ ∂θ , β ← β -∂L B ∂β . time t as a(t) = π θ (s(t)), where s(t) is a random variable at timestep t. The environment has several known spaces, i.e., the state space S ⊂ S, the initial space S 0 ⊂ S, and the unsafe space S u ⊂ S. The RL objective is to maximize the total discounted expected return as max π θ J := E s(0)∈S0,P (s ′ |s,a) T t=0 γ t r(s(t), a(t)) with P ∈ P. To encode the hard safety constraint S u , we formulate a bi-level optimization problem for our framework as the following. We use ˆto denote the elements related to the generative model. Definition 1 (Bi-level Optimization Problem for Safe RL) max θ,α J(π θ ) -λη * (θ, α) 2 -L g (τ θ , τθ,α ), where η * (θ, α) is the optimal objective to a lower-level problem of the generative-model-based soft barrier function with ŝ as the state in the generative model: min β η s.t. B β (ŝ) ≥ 0, ∀ ŝ ∈ S, B β (ŝ) ≥ 1, ∀ ŝ ∈ S u , B β (ŝ) ≤ η, ∀ ŝ ∈ S 0 , E [B β (ŝ(t + 1))|ŝ(t)] ≤ B β (ŝ(t)), ŝ(t + 1) = Mθ,α (ŝ(t)), ∀t ∈ [0, T ]∀ŝ ∈ S \ S u , ( ) where θ is the parameter of policy π. α is the parameter of the generative model Mθ,α = ( Ĝα , Σα ), which is a stochastic differential equation (SDE) with Ĝα as the drift function and Σα as the diffusion function for the stochasticity, as shown later in Equation 2. λ ≥ 0 is a penalty multiplier. We can compute the gradient from η * (θ, α) for π θ through Mθ,α with current auto-differential tools. This cannot be done in M as it is unknown. Therefore, the overall bi-level problem is end-to-end differentiable and can be solved efficiently. Figure 2 shows how the components in our framework interact with each other. The overall algorithm to solve the bi-level problem is shown in Algorithm 1. Next, we are going to introduce the details of each module.

3.2. GENERATIVE MODELING

The role of the generative model in our framework is two folds: (1) Because the barrier function requires an environment model to encode the hard safety constraints, the generative model serves as a surrogate model to build this barrier function, where η * (θ, α) propagates the gradient to π θ through Mθ,α for improving system safety. (2) The generative model can generate synthetic process (trajectory) τθ,α to optimize the performance of the policy efficiently by gradient propagation. We learn the generative model Mθ,α as a discrete-time SDE to capture the dynamics and stochasticity of the environment and serve as a base for the construction of the soft barrier function: Mθ,α : ŝ(t + 1) = Ĝα (ŝ(t), π θ (ŝ(t))) + Σα (ŝ(t))W (t), where Ĝα : R n × R m → R n is an unknown drift function, Σα : R n → R n×d is an unknown n × d matrix based on ŝ, and W (t) ∈ R d is the Brownian motion (also known as Wiener Process) with dimension d, encoding the stochasticity. When the environment is deterministic, we can simply set the Σ(s) as 0. We design the generative model to share the learning control policy with the real environment, as shown in Figure 2 . For the inference, the generative model starts from a sample ŝ(0) ∈ S 0 and rolls out by drift function Ĝα , diffusion function Σα , and policy π θ . Therefore, the computation graph contains the learning policy; thus, the auto-differential tools can obtain the gradient for the learning policy by back-propagating through the generative model. Remark 1 We use the fully-connected neural networks to encode such an SDE. Due to the continuity of the neural net, such SDE specification requires the environment dynamics to be continuous and smooth. Therefore our approach cannot handle hybrid dynamics with jump conditions such as the contact dynamics in Mujoco and Safety Gym. Such an assumption is not uncommon, as it remains an open problem to learn the discontinuous dynamics (Parmar et al., 2021; Pfrommer et al., 2021) . The generative model training is to reduce the following loss function. min α L g (τ θ , τθ,α ) = min α - T t=0 log P s(t) | N (ŝ(t), Σα (ŝ(t))) where L g is the maximum likelihood loss, P s(t) | N (ŝ(t), Σα (ŝ(t))) is the likelihood probability of the observed s(t) under the normal distribution of the SDE representation. We use torchsde (Li et al., 2020b) to fit the data τ θ = {s(0), s(1), • • • , s(T )} to the generative model by updating its parameter α, which is shown in Lines 2 to 4 in the Algorithm 1.

3.3. SOFT BARRIER FUNCTION LEARNING

To encode the hard constraints, we introduce a novel generative-model-based soft barrier function. Definition 2 (Safety Probability Lower Bound) A safety probability lower bound 1 -η of the entire trajectory (process ) τ = {s(0), s(1), • • • , s(T )} is defined as P (s(t) ̸ ∈ S u |s(0) ∈ S 0 , ∀t ∈ [0, T ]) ≥ 1 -η. Definition 3 (Barrier Function for SDE) Given a policy π θ , B β is a generative-model-based soft barrier function for the discrete-time SDE Mθ,α as in Equation 2, if it is twice differentiable and satisfies the constraints of the lower problem in Equation 1. Lemma 1 Prajna et al. (2004) Let B(ŝ(t)) be a supermartingale of the process ŝ(t) and B(ŝ) ≥ 0, ∀ŝ ∈ S. Then for any ŝ(0) ∈ S 0 , c > 0, P (sup t≥0 B(ŝ(t)) ≥ c | ŝ(0) ∈ S 0 ) ≤ B(ŝ(0)) c . Theorem 1 With a barrier function as in Definition 3, the generative-model SDE with policy π θ (Equation 2) has a safety probability lower bound 1 -η * , where η * is the optimal value in the lower problem of Equation 1, as ∀t ∈ [0, T ], P (ŝ(t) ̸ ∈ S u |ŝ(0) ∈ S 0 ) ≥ 1 -η * , ŝ(t + 1) = Mθ,α (ŝ(t)). Proof: With the last condition of the constraints in the lower problem of Equation 1, we have E [B(ŝ(t 2 ))|ŝ(t 1 )] ≤ B(ŝ(t 1 )), ∀T ≥ t 2 ≥ t 1 ≥ 0, which indicates that the barrier function B(ŝ) is a supermartingale. Then by leveraging the Lemma 1 above from Prajna et al. (2004) , we have P (ŝ(t) ∈ S u , for some t ∈ [0, T ] | ŝ(0) ∈ S 0 ) = P (B(ŝ(t)) ≥ 1, for some t ∈ [0, T ] | ŝ(0) ∈ S 0 ) ≤ P sup t∈[0,T ] B(ŝ(t)) ≥ 1 | ŝ(0) ∈ S 0 ≤ B(ŝ(0)) ≤ η * . Therefore, safety probability lower bound is 1 -η * and Theorem 1 holds. □ We further translate the constraints of the lower problem in Equation 1 with their sampling mean: min β η s.t.      1 N N i=1 B β (ŝ i (0)) ≤ η, ŝi (0) ∈ S 0 , 1 N N i=1 B β (ŝ i u ) ≥ 1, ŝi u ∈ S u , 1 N N i=1 B β (ŝ i ) ≥ 0, ŝi ∈ S, 1 N N i=1 B β (ŝ i (t + 1)) ≤ B β (ŝ i (t)), ŝi (t + 1) = Mθ,α (ŝ i (t)), ∀t ∈ [0, T ], ∀s i ∈ S \ S u . The third non-negative condition is easy to satisfy by setting the output activation function as Sigmoid for the barrier neural network. The last condition is to make B as a supermartingale, which is the key to deriving the lower bound of safety probability. In practice, we use a supervised-learningbased method to optimize this problem by minimizing the following loss function: min θ,β L B = 1 N N i=1 B β (ŝ i (0)) + 1 N N i=1 (1 -B β (ŝ i u )) + 1 N N i=1   1 M M j=1 B β (ŝ i,j (t + 1)) -B β (ŝ i (t))   , ŝi,j (t + 1) = Mθ,α (ŝ i (t)), t ∈ [0, T ], ŝi (t) ∈ S \ S u , where ŝi,j (t + 1) is the next state of ŝi (t) sampled from the generative model Mθ,α with policy π θ . L B essentially reduces the barrier mapping value on S 0 (the maximum is η * (θ, α)) and projects the unsafe space S u to 1 with Sigmoid output, and decreases the expectation of the barrier function along with trajectory. It is worthy to note that L B cannot be approximated by the real environment M, as we cannot sample from any intermediate time point s(t) to s(t + 1) in the space S \ S u to compute the third sample mean in Equation 4, which is relatively feasible and simple to do with Mθ,α as in Equation 2. The barrier training can be terminated if the second and third sample mean in Equation 4 are non-positive. The soft barrier training is shown as Line 5 in Algorithm 1.

3.4. POLICY OPTIMIZATION

As stated before, we use the generative model to generate synthetic data τ i θ,α = {ŝ i (0), • • • , ŝi (T )}(i ∈ [1, N ] ) with policy π θ to maximize the total expected return Ĵ(π θ ) as: max π θ Ĵ(π θ ) = E ŝ(0), Mθ,α T t=0 γ t r (ŝ(t), π θ (ŝ(t))) , s.t. ŝ(t + 1) = Mθ,α (ŝ(t)), ∀t ∈ [0, T ]. We use the sample mean from the synthetic trajectories as an estimate for the expectation: max π θ Ĵ(π θ ) = 1 N N i=0 T t=0 γ t r ŝi (t), π θ (ŝ i (t)) , s.t. ŝi (t + 1) = Mθ,α (ŝ i (t)), ∀t ∈ [0, T ]. ( ) With policy π θ in the forward computation graph of Mθ,α , we can directly obtain the backwards gradient for π θ from Equation 5. The policy optimization is shown as Line 6 in the Algorithm 1.

3.5. THEORETICAL ANALYSIS OF SAFETY PROBABILITY BY SOFT BARRIER

For the final learned policy, we conduct a theoretical analysis of its safety probability (as defined in Definition 2), derived from the generative-model-based soft barrier function in our framework. Lemma 2 (Theorem 21 in (Agarwal et al., 2020a) ) Given δ ∈ (0, 1), a learned deterministic policy π θ (s) and assume the environment-policy transition dynamics as P * (s ′ |s) ∈ P with the function class |P| < ∞ (s ′ represents the next state of s), let the environment and policy generate a dataset of n trajectories D := {(s j (t), s j (t + 1))} T t=0 (j = 1, • • • , n), s(t) ∼ D t = (s j (0 : t -1)). Note that D t is a martingale depending on the previous examples. Let the generative model Mθ,α maximize the likelihood of the dataset by its transition dynamics P via Equation 3. Then with at least probability 1 -δ, the expectation of total variation distance between P * and P is bounded as: T t=0 E s∼D t d TV (P * , P ) = T t=0 E s∼D t P (s ′ |s) -P * (s ′ |s) 2 TV ≤ 2 log(|P|/δ) n (6) Lemma 3 (proof provided in the Appendix A) Given a random variable X n ≥ 0 on a probability space Ω, if E Ω [X n ] → 0 as n → ∞, then P (X n = 0) → 1. Proposition 1 (Asymptotic Lower Bound of Safety Probability) Given the learned policy π θ , let the generative model fit n sample trajectories τ i θ (i = 1, • • • , n) from environment M with π θ by Equation 3, learn the generative-model-based soft barrier function B β by Equation 4with η * and assume that it formally satisfies the constraints in Equation 1, then the real environment M with policy π θ is safe with at least probability (1 -η * ) when n → ∞. Proof of Proposition 1: Given (S, B) as the measure spaces with S as the state space and B = {B : S → R, ∥B∥ ∞ ≤ 1}, where B is a generative-model-based soft barrier function with Sigmoid output, then according to the definition of total variation distance and Lemma 2, we have T t=0 E s∼D t d TV (P * , P ) = T t=0 E s∼D t 1 2 sup B∈B E P * (s ′ |s) [B(s ′ )] -E P (s ′ |s) [B(s ′ )] ≤ 2 log(|P|/δ) n . When n → ∞, set δ = 1 n , let X n = 1 2 sup B∈B E P * (s ′ |s) [B(s ′ )] -E P (s ′ |s) [B(s ′ )], and therefore E[X n ] → 0. We know X n ≥ 0, since X n = 0 when P * = P . Therefore, according to Lemma 3, P (X n → 0) → 1. We then assume D t (t ∈ [0, T ]) can uniformly cover the space S as n → ∞, thus the soft barrier becomes a true barrier function for the real environment and Proposition 1 holds. □ Remark 2 (Practical Safety Probability Lower Bound) In addition to the asymptotic safety probability, we propose a finite-sample practical safety probability lower bound. We first sample the generative model and the environment with the final learned policy to quantify their maximum distance per state as ∆ = max t∈[0,T ],i=1,••• ,N |s i (t) -ŝi (t)|, and then enlarge the unsafe region with ∆ by Minkowski sum as S ′ u = S u ∆. Next, we retrain another generative-model-based soft barrier function B with S ′ u . Finally, we conservatively report (1 -max (ŝ∈τ i t ,i=1,••• ,N ) B(ŝ i t ) ) as the final lower bound of safety probability by the soft barrier function. Remark 3 (During-learning Safety) The above asymptotic and practical safety bounds are derived for the final learned policy. It is possible that 1 -η * is not a valid safety probability bound during learning, as there exist a modeling gap between the generative model and the real environment. However, we optimize 1 -η * during learning to increase the chance of finding safer learned policies at the end, as demonstrated in our experiments below.

4. EXPERIMENTAL RESULTS

Experiment Settings and Examples: We compare our approach with two state-of-the-art opensource CMDP-based methods, PPO-L (Ray et al., 2019) and FOCOPS (Zhang et al., 2020) . For these two baselines, we design the cost function such that the state is safe if its cost is less than 0. It is worth noting that PPO-L has a stronger safety constraint than FOCOPS as we implemented the PPO-L with the expectation of cost per state as E[c(s, a)] ≤ 0, rather than the cumulative cost in FOCOPS as E We mark this safety-oriented version FOCOPS*. We mainly compare the converged final policy of each method in system safe rate measured via simulations -we call it empirical safe rate. We Table 1 : Comparison of our approach with CMDP-based baselines PPO-L and FOCOPS*. s e is the safe rate by simulating 500 random initial states from S 0 . 1-η is the practical lower bound of safety probability in our approach as (1 -max (ŝ∈τ i t ,i=1,••• ,n) B(ŝ i t ) ), derived by Remark 2. Our approach achieves significantly higher s e than the baselines. It is observed that 1 -η is a lower bound of s e . Note that learning safe control policy for high-dimensional systems is quite challenging. Current state-of-the-art works of certificate-based policy learning mainly focus on low-dimensional systems with fewer than 6 dimensional states (Luo & Ma, 2021; Lindemann et al., 2021; Chang et al., 2019; Berkenkamp et al., 2017) . In this paper, we test our approach on 13D UAV and Rocket examples.

Metric

2-Dimensional SDE (Prajna et al., 2004) has the unknown dynamics M as ṡ1 = 0.8s 2 , ds 2 = (a -0.3s 3 1 )dt + 0.2dW (t) (W (t), Wiener process.) Initial space S 0 = {(s 1 + 2) 2 + s 2 ≤ 0.01}, and unsafe space S u = {s 1 ∈ [-1, 0], s 2 ∈ [1.2, 1.7]}. The goal is to stabilize the system near (0, 0). Cartpole Balancing (Brockman et al., 2016) has a 4-dimensional vector s = [x, θ, ẋ, θ] as the system state, where x is the position and θ is the angular error to the upright. The initial space S 0 = {(x, θ, ẋ, θ)|x ∈ [-0.167, 0.033], θ ∈ [-0.6, -0.5], ẋ = -0.35, θ = 0.53}, and unsafe space S u = {(x, θ, ẋ, θ) | x ≤ -0.75}. The goal is to keep the cartpole balanced upright. Powered Rocket Landing (Jin et al., 2021) has 6 DoF (degrees of freedom) with 13 system states and 3 action variables. The goal is to land the rocket close to the original point while avoiding an unsafe region. Its state vector is s = [p v q ω] ∈ R 13 , where p = (x, y, z) ∈ R 3 and v = (v x , v y , v z ) ∈ R 3 represent the position and velocity of the rocket, respectively. q ∈ R 4 is the unit quaternion for attitude and ω ∈ R 3 is the angular velocity with respect to the inertial frame. There are three trust forces for the rocket as the control input u = [T x , T y , T z ] ∈ R 3 . The initial space S 0 : p = (x, y, z)(x -10) 2 + (y + 8) 2 + (z -5) 2 ≤ 0.01, v = 0, q = (0.73, 0, 0, 0.68), ω = 0, and unsafe space S u : p = (x, y, z)(x -5) 2 + y 2 ≤ 1, -2 ≤ z ≤ 5, ∥v∥ 1 ≤ 10, ∥ω∥ 1 ≤ 10. UAV Maneuvering (Jin et al., 2021) is to maneuver an UAV close to the original point while avoiding an obstacle. The 6-DoF UAV has 13 system states and 4 action variables. Its state vector is s = [p v q ω] ∈ R 13 , same with above Rocket example. The control input u = [T 1 , T 2 , T 3 , T 4 ] ∈ R 4 includes the four rotating propellers of the quadrotor. S 0 : p = (x, y, z)(x+8) 2 +(y+6) 2 +(z-9) 2 ≤ 0.01, v = 0, q = (1, 0, 0, 0), ω = 0, S u : p = (x, y, z)(x + 4.5) 2 + (y + 4) 2 ≤ 1, -2 ≤ z ≤ 5. Comparison and Effectiveness of Our Approach: Table 1 shows the comparison results in simulation-based system safe rate (based on 500 simulations for each example, with random initial states), safety probability, and performance. We can see that by directly enforcing hard safety constraints via soft barrier functions, our approach can achieve significantly higher system safe rate than the CMDP-based baselines. Our approach also provides a practical lower bound of safety probability, which the CMDP-based methods cannot provide. CMDP achieves better performance (total reward return) in some cases, but we view safety as the first priority for these systems and the focus of this work. Figure 3 shows the control trajectories by the learned policies from our approach and the baselines. The agent is always safe with our learned policy, while there exist unsafe cases by both PPO-L and FOCOPS. Moreover, our generative model behaves very similarly to the real environment, which shows the usefulness of the generative modeling for constructing the soft barrier function and opti- Figure 3 : Control trajectories by the learned policies from our approaches and baselines. "Gene" indicates the synthetic trajectory from the final learned generative model, which behaves very similarly to the real environment with "Ours" policy, showing its effectiveness for barrier function construction. We can see that our approach learns safer policies than the baselines. Limitations: As stated earlier, one key assumption of this work is the smoothness and continuity of the system behavior, which prevents its application to hybrid dynamics with jump conditions such as the contact dynamics in Mojuco and Safety Gym. One possible solution is to learn an ensemble generative model as a hybrid system to deal with those discontinuous contact dynamics, and we plan to explore it in future work. Another limitation of our framework is on the computation complexity of the generative model (e.g., it takes around 8 hours to learn a policy for the Cartpole example and 1 day for the UAV and Rocket examples). In future work, we plan to improve the efficiency of this part by exploring techniques such as Continuous Latent Process Flows (CLPF) Deng et al. (2021) .

5. CONCLUSION

We present a safe RL approach for unknown stochastic environment that enforces hard reachabilitybased safety constraints through generative-model-based soft barrier functions. Our approach formulates a novel bi-level optimization formulation, and develops a safe RL algorithm that jointly learns the generative model, soft barrier function, and policy optimization. Experiments demonstrate that our approach can significantly improve empirical system safe rate over CMDP-based baselines and also provide a practical lower bound of safety probability. For any m ∈ N, let E m = {ω ∈ Ω : X n (w) > 1 m }. Since X n ≥ 0, we have: E Ω [X n ] = Ω X n dP ≥ Em X n dP ≥ 1 m P (E m ). Therefore, P (E m ) → 0, and then:  0 ≤ P (ω



finite-horizon continuous MDP M ∼ (S, A, P, r, γ, π), where S ∈ R n represents the continuous state space, A ∈ R m indicates the continuous action space, and the function class P : S × A × S →[0, 1] denotes the unknown continuous and smooth stochastic environment dynamics without jump condition. The rewards function r(s, a) : S × A → R is known and the discount factor γ ∈ [0, 1]. A deterministic continuous NN-based policy π θ : S → A maps the states s(t) ∈ S to an action a(t) ∈ A at Algorithm 1 Safe RL with the Generative-model-based Soft Barrier Function Input: Unknown environment M, initial policy π θ , generative model Mθ,α , barrier network B β . Parameter: [θ, α, β]. Output: Policy π θ with Mθ,α based soft barrier function B β .

τ θ := {s(0), s(1), • • • , s(T )} and τθ,α := {ŝ(0), ŝ(1), • • • , ŝ(T )} are the sampled realizations of stochastic processes (trajectories) from the environment and from the generative model by the policy π θ , respectively. β is the parameter of the generative-model-based soft barrier function B β : R n → R + . We encode the hard safety constraint by the generative-model-based soft barrier function B β in the lower problem, which minimizes η * (θ, α) as the upper bound of the unsafe probability for Mθ,α in Section 3.3. The upper problem aims to optimize the policy's expected return J(π θ ) and learn the generative model by the maximum likelihood loss L g (τ θ , τθ,α ) between the processes τ θ and τθ,α as shown later in Equation3. Moreover, the upper problem penalizes η * (θ, α), which can back propagate the gradient information through Mθ,α to π θ for pushing the agent to avoid S u in the environment MDP M as much as possible if Mθ,α behaves similar to M.

t=0 c(s, a) ≤ D ′ . In FOCOPS, We conservatively set D ′ = -60 for the 2D and cartpole examples below, and -200 for the Rocket and UAV examples, to improve its safety.

Figure 5: Barrier function training and testing in the UAV example.

Figure 6: Barrier function training and testing in Rocket powered landing.

Figure 7: Barrier function training and testing in Cartpole balancing.

∈ Ω : X n (w) ̸ = 0) = P ( E m ) = lim m→∞ P (E m ) → 0, P (ω ∈ Ω : X n (w) ̸ = 0) → 0 =⇒ P (ω ∈ Ω : X n (w) = 0) → 1.A.2 ADDITIONAL EXPERIMENTAL RESULTSThe barrier function training and testing results for the Rocket powered landing and the Cartpole balancing examples are shown here in Figures6 and 7.

al.,

