ON THE ROBUSTNESS OF SAFE REINFORCEMENT LEARNING UNDER OBSERVATIONAL PERTURBATIONS

Abstract

Safe reinforcement learning (RL) trains a policy to maximize the task reward while satisfying safety constraints. While prior works focus on the performance optimality, we find that the optimal solutions of many safe RL problems are not robust and safe against carefully designed observational perturbations. We formally analyze the unique properties of designing effective observational adversarial attackers in the safe RL setting. We show that baseline adversarial attack techniques for standard RL tasks are not always effective for safe RL and propose two new approaches -one maximizes the cost and the other maximizes the reward. One interesting and counter-intuitive finding is that the maximum reward attack is strong, as it can both induce unsafe behaviors and make the attack stealthy by maintaining the reward. We further propose a robust training framework for safe RL and evaluate it via comprehensive experiments. This paper provides a pioneer work to investigate the safety and robustness of RL under observational attacks for future safe RL studies. Code is available at: https://github.com/liuzuxin/ safe-rl-robustness 

1. INTRODUCTION

Despite the great success of deep reinforcement learning (RL) in recent years, it is still challenging to ensure safety when deploying them to the real world. Safe RL tackles the problem by solving a constrained optimization that can maximize the task reward while satisfying safety constraints (Brunke et al., 2021) , which has shown to be effective in learning a safe policy in many tasks (Zhao et al., 2021; Liu et al., 2022; Sootla et al., 2022b) . The success of recent safe RL approaches leverages the power of neural networks (Srinivasan et al., 2020; Thananjeyan et al., 2021) . However, it has been shown that neural networks are vulnerable to adversarial attacks -a small perturbation of the input data may lead to a large variance of the output (Machado et al., 2021; Pitropakis et al., 2019) , which raises a concern when deploying a neural network RL policy to safety-critical applications (Akhtar & Mian, 2018) . While many recent safe RL methods with deep policies can achieve outstanding constraint satisfaction in noise-free simulation environments, such a concern regarding their vulnerability under adversarial perturbations has not been studied in the safe RL setting. We consider the observational perturbations that commonly exist in the physical world, such as unavoidable sensor errors and upstream perception inaccuracy (Zhang et al., 2020a) . Several recent works of observational robust RL have shown that deep RL agent could be attacked via sophisticated observation perturbations, drastically decreasing their rewards (Huang et al., 2017; Zhang et al., 2021) . However, the robustness concept and adversarial training methods in standard RL settings may not be suitable for safe RL because of an additional metric that characterizes the cost of constraint violations (Brunke et al., 2021) . The cost should be more important than the measure of reward, since any constraint violations could be fatal and unacceptable in the real world (Berkenkamp et al., 2017) . For example, consider the autonomous vehicle navigation task where the reward is to reach the goal as fast as possible and the safety constraint is to not collide with obstacles, then sacrificing some reward is not comparable with violating the constraint because the latter may cause catastrophic consequences. However, we find little research formally studying the robustness in the safe RL setting with adversarial observation perturbations, while we believe this should be an important aspect in the safe RL area, because a vulnerable policy under adversarial attacks cannot be regarded as truly safe in the physical world. We aim to address the following questions in this work: 1) How vulnerable would a learned RL agent be under observational adversarial attacks? 2) How to design effective attackers in the safe RL setting? 3) How to obtain a robust policy that can maintain safety even under worst-case perturbations? To answer them, we formally define the observational robust safe RL problem and discuss how to evaluate the adversary and robustness of a safe RL policy. We also propose two strong adversarial attacks that can induce the agent to perform unsafe behaviors and show that adversarial training can help improve the robustness of constraint satisfaction. We summarize the contributions as follows. 1. We formally analyze the policy vulnerability in safe RL under observational corruptions, investigate the observational-adversarial safe RL problem, and show that the optimal solutions of safe RL problems are vulnerable under observational adversarial attacks. 2. We find that existing adversarial attacks focusing on minimizing agent rewards do not always work, and propose two effective attack algorithms with theoretical justifications -one directly maximizes the cost, and one maximizes the task reward to induce a tempting but risky policy. Surprisingly, the maximum reward attack is very strong in inducing unsafe behaviors, both in theory and practice. We believe this property is overlooked as maximizing reward is the optimization goal for standard RL, yet it leads to risky and stealthy attacks to safety constraints. 3. We propose an adversarial training algorithm with the proposed attackers and show contraction properties of their Bellman operators. Extensive experiments in continuous control tasks show that our method is more robust against adversarial perturbations in terms of constraint satisfaction.

2. RELATED WORK

Safe RL. One type of approach utilizes domain knowledge of the target problem to improve the safety of an RL agent, such as designing a safety filter (Dalal et al., 2018; Yu et al., 2022) , assuming sophisticated system dynamics model (Liu et al., 2020; Luo & Ma, 2021; Chen et al., 2021) , or incorporating expert interventions (Saunders et al., 2017; Alshiekh et al., 2018) . Constrained Markov Decision Process (CMDP) is a commonly used framework to model the safe RL problem, which can be solved via constrained optimization techniques (Garcıa & Fernández, 2015; Gu et al., 2022; Sootla et al., 2022a; Flet-Berliac & Basu, 2022) . The Lagrangian-based method is a generic constrained optimization algorithm to solve CMDP, which introduces additional Lagrange multipliers to penalize constraints violations (Bhatnagar & Lakshmanan, 2012; Chow et al., 2017; As et al., 2022) . The multiplier can be optimized via gradient descent together with the policy parameters (Liang et al., 2018; Tessler et al., 2018) , and can be easily incorporated into many existing RL methods (Ray et al., 2019) . Another line of work approximates the non-convex constrained optimization problem with low-order Taylor expansions and then obtains the dual variable via convex optimization (Yu et al., 2019; Yang et al., 2020; Gu et al., 2021; Kim & Oh, 2022) . Since the constrained optimization-based methods are more general, we will focus on the discussions of safe RL upon them. Robust RL. The robustness definition in the RL context has many interpretations (Sun et al., 2021; Moos et al., 2022; Korkmaz, 2023) , including the robustness against action perturbations (Tessler et al., 2019) , reward corruptions (Wang et al., 2020; Lin et al., 2020; Eysenbach & Levine, 2021) , domain shift (Tobin et al., 2017; Muratore et al., 2018) , and dynamics uncertainty (Pinto et al., 2017; Huang et al., 2022) . The most related works are investigating the observational robustness of an RL agent under observational adversarial attacks (Zhang et al., 2020a; 2021; Liang et al., 2022; Korkmaz, 2022) . It has been shown that the neural network policies can be easily attacked by adversarial observation noise and thus lead to much lower rewards than the optimal policy (Huang et al., 2017; Kos & Song, 2017; Lin et al., 2017; Pattanaik et al., 2017) . However, most of the robust RL approaches model the attack and defense regarding the reward, while the robustness regarding safety, i.e., constraint satisfaction for safe RL, has not been formally investigated.

3. OBSERVATIONAL ADVERSARIAL ATTACK FOR SAFE RL

3.1 MDP, CMDP, AND THE SAFE RL PROBLEM An infinite horizon Markov Decision Process (MDP) is defined by the tuple (S, A, P, r, γ, µ 0 ), where S is the state space, A is the action space, P : S × A × S -→ [0, 1] is the transition kernel that specifies the transition probability p(s t+1 |s t , a t ) from state s t to s t+1 under the action a t , r : S × A × S -→ R is the reward function, γ -→ [0, 1) is the discount factor, and µ 0 : S -→ [0, 1] is the initial state distribution. We study safe RL under the Constrained MDP (CMDP) framework M := (S, A, P, r, c, γ, µ 0 ) with an additional element c : S × A × S -→ [0, C m ] to characterize the cost for violating the constraint, where C m is the maximum cost (Altman, 1998) . We denote a safe RL problem as M κ Π , where Π : S ×A → [0, 1] is the policy class, and κ -→ [0, +∞) is the cost threshold. Let π(a|s) ∈ Π denote the policy and τ = {s 0 , a 0 , ..., } denote the trajectory. We use shorthand f t = f (s t , a t , s t+1 ), f ∈ {r, c} for simplicity. The value function is V π f (µ 0 ) = E τ ∼π,s0∼µ0 [ ∞ t=0 γ t f t ], which is the expectation of discounted return under the policy π and the initial state distribution µ 0 . We overload the notation V π f (s) = E τ ∼π,s0=s [ ∞ t=0 γ t f t ] to denote the value function with the initial state s 0 = s, and denote Q π f (s, a) = E τ ∼π,s0=s,a0=a [ ∞ t=0 γ t f t ] as the state-action value function under the policy π. The objective of M κ Π is to find the policy that maximizes the reward while limiting the cost under threshold κ: We then define feasibility, optimality and temptation to better describe the properties of a safe RL problem M κ Π . The figure illustration of one example is shown in Fig. 1 . Note that although the temptation concept naturally exists in many safe RL settings under the CMDP framework, we did not find formal descriptions or definitions of it in the literature. Definition 1. Feasibility. The feasible policy class is the set of policies that satisfies the constraint with threshold κ: Π κ M := {π(a|s) : V π c (µ 0 ) ≤ κ, π ∈ Π}. A feasible policy should satisfy π ∈ Π κ M . Definition 2. Optimality. A policy π * is optimal in the safe RL context if 1) it is feasible: π * ∈ Π κ M ; 2) no other feasible policy has higher reward return than it: ∀π ∈ Π κ M , V π * r (µ 0 ) ≥ V π r (µ 0 ). We denote π * as the optimal policy. Note that the optimality is defined w.r.t. the reward return within the feasible policy class Π κ M rather than the full policy class space Π, which means that policies that have a higher reward return than π * may exist in a safe RL problem due to the constraint, and we formally define them as tempting policies because they are rewarding but unsafe: Definition 3. Temptation. We define the tempting policy class as the set of policies that have a higher reward return than the optimal policy: Π T M := {π(a|s) : π * = arg max π V π r (µ 0 ), s.t. V π c (µ 0 ) ≤ κ. V π r (µ 0 ) > V π * r (µ 0 ), π ∈ Π}. A tempting safe RL problem has a non-empty tempting policy class: Π T M ̸ = ∅. We show that all the tempting policies are not feasible (proved by contradiction in Appendix A.1): Lemma 1. The tempting policy class and the feasible policy class are disjoint: Π T M ∩ Π κ M = ∅. Namely, all the tempting policies violate the constraint: ∀π ∈ Π T M , V π c (µ 0 ) > κ. The existence of tempting policies is a unique feature, and one of the major challenges of safe RL since the agent needs to maximize the reward carefully to avoid being tempted. One can always tune the threshold κ to change the temptation status of a safe RL problem with the same CMDP. In this paper, we only consider the solvable tempting safe RL problems because otherwise, the non-tempting safe RL problem M κ Π can be reduced to a standard RL problem -an optimal policy could be obtained by maximizing the reward without considering the constraint.

3.2. SAFE RL UNDER OBSERVATIONAL PERTURBATIONS

We introduce a deterministic observational adversary ν(s) : S -→ S which corrupts the state observation of the agent. We denote the corrupted observation as s := ν(s) and the corrupted policy as π • ν := π(a|s) = π(a|ν(s)), as the state is first contaminated by ν and then used by the operator π. Note that the adversary does not modify the original CMDP and true states in the environment, but only the input of the agent. This setting mimics realistic scenarios, for instance, the adversary could be the noise from the sensing system or the errors from the upstream perception system. Constraint satisfaction is of the top priority in safe RL, since violating constraints in safety-critical applications can be unaffordable. In addition, the reward metric is usually used to measure the agent's performance in finishing a task, so significantly reducing the task reward may warn the agent of the existence of attacks. As a result, a strong adversary in the safe RL setting aims to generate more constraint violations while maintaining high rewards to make the attack stealthy. In contrast, existing adversaries on standard RL aim to reduce the overall reward. Concretely, we evaluate the adversary performance for safe RL from two perspectives: Definition 4. (Attack) Effectiveness J E (ν, π) is defined as the increased cost value under the adversary: J E (ν, π) = V π•ν c (µ 0 ) -V π c (µ 0 ). An adversary ν is effective if J E (ν, π) > 0. The effectiveness metric measures an adversary's capability of attacking the safe RL agent to violate constraints. We additionally introduce another metric to characterize the adversary's stealthiness w.r.t. the task reward in the safe RL setting. Definition 5. (Reward) Stealthiness J S (ν, π) is defined as the increased reward value under the adversary: J S (ν, π) = V π•ν r (µ 0 ) -V π r (µ 0 ). An adversary ν is stealthy if J S (ν, π) ≥ 0. Note that the stealthiness concept is widely used in supervised learning (Sharif et al., 2016; Pitropakis et al., 2019) . It usually means that the adversarial attack should be covert to human eyes regarding the input data so that it can hardly be identified (Machado et al., 2021) . While the stealthiness regarding the perturbation range is naturally satisfied based on the perturbation set definition, we introduce another level of stealthiness in terms of the task reward in the safe RL task. In some situations, the agent might easily detect a dramatic reward drop. A more stealthy attack is maintaining the agent's task reward while increasing constraint violations; see Appendix B.1 for more discussions. In practice, the power of the adversary is usually restricted (Madry et al., 2017; Zhang et al., 2020a) , such that the perturbed observation will be limited within a pre-defined perturbation set B(s): ∀s ∈ S, ν(s) ∈ B(s). Following convention, we define the perturbation set B ϵ p (s) as the ℓ p -ball around the original observation: ∀s ′ ∈ B ϵ p (s), ∥s ′ -s∥ p ≤ ϵ, where ϵ is the ball size.

3.3. VULNERABILITY OF AN OPTIMAL POLICY UNDER ADVERSARIAL ATTACKS

We aim to design strong adversaries such that they are effective in making the agent unsafe and keep reward stealthiness. Motivated by Lemma 1, we propose the Maximum Reward (MR) attacker that corrupts the observation by maximizing the reward value: ν MR = arg max ν V π•ν r (µ 0 ) Proposition 1. For an optimal policy π * ∈ Π, the MR attacker is guaranteed to be reward stealthy and effective, given enough large perturbation set B ϵ p (s) such that V π * •νMR r > V π * r . The MR attacker is counter-intuitive because it is exactly the goal for standard RL.This is an interesting phenomenon worthy of highlighting since we observe that the MR attacker effectively makes the optimal policy unsafe and retains stealthy regarding the reward in the safe RL setting. The proof is given in Appendix A.1. If we enlarge the policy space from Π : S × A → [0, 1] to an augmented space Π : S × A × O → [0, 1] , where O = {0, 1} is the space of indicator, we can further observe the following important property for the optimal policy: Lemma 2. The optimal policy π * ∈ Π of a tempting safe RL problem satisfies: V π * c (µ 0 ) = κ. The proof is given in Appendix A.2. The definition of the augmented policy space is commonly used in hierarchical RL and can be viewed as a subset of option-based RL (Riemer et al., 2018; Zhang & Whiteson, 2019) . Note that Lemma 2 holds in expectation rather than for a single trajectory. It suggests that the optimal policy in a tempting safe RL problem will be vulnerable as it is on the safety boundary, which motivates us to propose the Maximum Cost (MC) attacker that corrupts the observation of a policy π by maximizing the cost value: ν MC = arg max ν V π•ν c (µ 0 ) It is apparent to see that the MC attacker is effective w.r.t. the optimal policy with a large enough perturbation range, since we directly solve the adversarial observation such that it can maximize the constraint violations. Therefore, as long as ν MC can lead to a policy that has a higher cost return than π * , it is guaranteed to be effective in making the agent violate the constraint based on Lemma 2. Practically, given a fixed policy π and its critics Q π f (s, a), f ∈ {r, c}, we obtain the corrupted observation s of s from the MR and MC attackers by solving: ν MR (s) = arg max s∈B ϵ p (s) E ã∼π(a|s) [Q π r (s, ã))] , ν MC (s) = arg max s∈B ϵ p (s) E ã∼π(a|s) [Q π c (s, ã))] (2) Suppose the policy π and the critics Q are all parametrized by differentiable models such as neural networks, then we can back-propagate the gradient through Q and π to solve the adversarial observation s. This is similar to the policy optimization procedure in DDPG (Lillicrap et al., 2015) , whereas we replace the optimization domain from the policy parameter space to the observation space B ϵ p (s). The attacker implementation details can be found in Appendix C.1.

3.4. THEORETICAL ANALYSIS OF ADVERSARIAL ATTACKS

Theorem 1 (Existence of optimal and deterministic MC/MR attackers). A deterministic MC attacker ν MC and a deterministic MR attacker ν MR always exist, and there is no stochastic adversary ν ′ such that V π•ν ′ c (µ 0 ) > V π•νMC c (µ 0 ) or V π•ν ′ r (µ 0 ) > V π•νMR r (µ 0 ). Theorem 1 provides the theoretical foundation of Bellman operators that require optimal and deterministic adversaries in the next section. The proof is given in Appendix A.3. We can also obtain the upper-bound of constraint violations of the adversary attack at state s. Denote S c as the set of unsafe states that have non-zero cost: S c := {s ′ ∈ S : c(s, a, s ′ ) > 0} and p s as the maximum probability of entering unsafe states from state s: p s = max a s ′ ∈Sc p(s ′ |s, a). Theorem 2 (One-step perturbation cost value bound). Suppose the optimal policy is locally L-Lipschitz continuous at state s: D TV [π(•|s ′ )∥π(•|s)] ≤ L ∥s ′ -s∥ p , and the perturbation set of the adversary ν(s) is an ℓ p -ball B ϵ p (s). Let Ṽ π,ν c (s) = E a∼π(•|ν(s)),s ′ ∼p(•|s,a) [c(s, a, s ′ ) + γV π c (s ′ ) ] denote the cost value for only perturbing state s. The upper bound of Ṽ π,ν c (s) is given by: Ṽ π,ν c (s) -V π c (s) ≤ 2Lϵ p s C m + γC m 1 -γ . Note that Ṽ π,ν c (s) ̸ = V π c (ν(s)) because the next state s ′ is still transited from the original state s, i.e., s ′ ∼ p(•|s, a) instead of s ′ ∼ p(•|ν(s), a). Theorem 2 indicates that the power of an adversary is controlled by the policy smoothness L and perturbation range ϵ. In addition, the p s term indicates that a safe policy should keep a safe distance from the unsafe state to prevent it from being attacked. We further derive the upper bound of constraint violation for attacking the entire episodes. Theorem 3 (Episodic bound). Given a feasible policy π ∈ Π κ M , suppose L-Lipschitz continuity holds globally for π, and the perturbation set is an ℓ p -ball, then the following bound holds: V π•ν c (µ 0 ) ≤ κ + 2LϵC m 1 1 -γ + 4γLϵ (1 -γ) 2 max s p s + γ 1 -γ . ( ) See Theorem 2, 3 proofs in Appendix A.4, A.5. We can still observe that the maximum cost value under perturbations is bounded by the Lipschitzness of the policy and the maximum perturbation range ϵ. The bound is tight since when ϵ -→ 0 (no attack) or L -→ 0 (constant policy π(•|s) for all states), the RHS is 0 for Eq. (3) and κ for Eq. ( 4), which means that the attack is ineffective.

4.1. ADVERSARIAL TRAINING AGAINST OBSERVATIONAL PERTURBATIONS

To defend against observational attacks, we propose an adversarial training method for safe RL. We directly optimize the policy upon the corrupted sampling trajectories τ = {s 0 , ã0 , s 1 , ã1 , ...}, where ãt ∼ π(a|ν(s t )). We can compactly represent the adversarial safe RL objective under ν as: π * = arg max π V π•ν r (µ 0 ), s.t. V π•ν c (µ 0 ) ≤ κ, ∀ν. (5) The adversarial training objective ( 5) can be solved by many policy-based safe RL methods, such as the primal-dual approach, and we show that the Bellman operator for evaluating the policy performance under a deterministic adversary is a contraction (see Appendix A.6 for proof). Theorem 4 (Bellman contraction). Define the Bellman policy operator as T π : R |S| -→ R |S| : (T π V π•ν f )(s) = a∈A π(a|ν(s)) s ′ ∈S p(s ′ |s, a) f (s, a, s ′ ) + γV π•ν f (s ′ ) , f ∈ {r, c}. (6) The Bellman equation can be written as V π•ν f (s) = (T π V π•ν f )(s). In addition, the operator T π is a contraction under the sup-norm ∥ • ∥ ∞ and has a fixed point. Theorem 4 shows that we can accurately evaluate the task performance (reward return) and the safety performance (cost return) of a policy under one fixed deterministic adversary, which is similar to solving a standard CMDP. The Bellman contraction property provides the theoretical justification of adversarial training, i.e., training a safe RL agent under observational perturbed sampling trajectories. Then the key part is selecting proper adversaries during learning, such that the trained policy is robust and safe against any other attackers. We can easily show that performing adversarial training with the MC or the MR attacker will enable the agent to be robust against the most effective or the most reward stealthy perturbations, respectively (see Appendix A.6 for details). Remark 1. Suppose a trained policy π ′ under the MC attacker satisfies: V π ′ •νMC c (µ 0 ) ≤ κ, then π ′ • ν is guaranteed to be feasible with any B ϵ p bounded adversarial perturbations. Similarly, suppose a trained policy π ′ under the MR attacker satisfies: V π ′ •νMR c (µ 0 ) ≤ κ, then π ′ • ν is guaranteed to be non-tempting with any B ϵ p bounded adversarial perturbations. Remark 1 indicates that by solving the adversarial constrained optimization problem under the MC attacker, all the feasible solutions will be safe under any bounded adversarial perturbations. It also shows a nice property for training a robust policy, since the max operation over the reward in the safe RL objective may lead the policy to the tempting policy class, while the adversarial training with MR attacker can naturally keep the trained policy at a safe distance from the tempting policy class. Practically, we observe that both MC and MR attackers can increase the robustness and safety via adversarial training, and could be easily plugged into any on-policy safe RL algorithms, in principle. We leave the robust training framework for off-policy safe RL methods as future work. We particularly adopt the primaldual methods (Ray et al., 2019; Stooke et al., 2020) that are widely used in the safe RL literature as the learner, then the adversarial training objective in Eq. ( 5) can be converted to a minmax form by using the Lagrange multiplier λ: (π * , λ * ) = min λ≥0 max π∈Π V π•ν r (µ 0 ) -λ(V π•ν c (µ 0 ) -κ) Solving the inner maximization (primal update) via any policy optimization methods and the outer minimization (dual update) via gradient descent iteratively yields the Lagrangian algorithm. Under proper learning rates and bounded noise assumptions, the iterates (π n , λ n ) converge to a fixed point (a local minimum) almost surely (Tessler et al., 2018; Paternain et al., 2019) . Based on previous theoretical analysis, we adopt MC or MR as the adversary when sampling trajectories. The scheduler aims to train the reward and cost Q-value functions for the MR and the MC attackers, because many on-policy algorithms such as PPO do not use them. In addition, the scheduler can update the power of the adversary based on the learning progress accordingly, since a strong adversary at the beginning may prohibit the learner from exploring the environment and thus corrupt the training. We gradually increase the perturbation range ϵ along with the training epochs to adjust the adversary perturbation set B ϵ p , such that the agent will not be too conservative in the early stage of training. A similar idea is also used in adversarial training (Salimans et al., 2016; Arjovsky & Bottou, 2017; Gowal et al., 2018) and curriculum learning literature (Dennis et al., 2020; Portelas et al., 2020) . See more implementation details in Appendix C.3.

5. EXPERIMENT

In this section, we aim to answer the questions raised in Sec. 1. To this end, we adopt the robot locomotion continuous control tasks that are easy to interpret, motivated by safety, and used in many previous works (Achiam et al., 2017; Chow et al., 2019; Zhang et al., 2020b) . The simulation environments are from a public available benchmark (Gronauer, 2022) . We consider two tasks, and train multiple different robots (Car, Drone, Ant) for each task: Run task. Agents are rewarded for running fast between two safety boundaries and are given costs for violation constraints if they run across the boundaries or exceed an agent-specific velocity threshold. The tempting policies can violate the velocity constraint to obtain more rewards. Circle task. The agents are rewarded for running in a circle in a clockwise direction but are constrained to stay within a safe region that is smaller than the radius of the target circle. The tempting policies in this task will leave the safe region to gain more rewards. We name each task via the Robot-Task format, for instance, Car-Run. More detailed descriptions and video demos are available on our anonymous project website 1 . In addition, we will use the PID PPO-Lagrangian (abbreviated as PPOL) method (Stooke et al., 2020) as the base safe RL algorithm to fairly compare different robust training approaches, while the proposed adversarial training can be easily used in other on-policy safe RL methods as well. The detailed hyperparameters of the adversaries and safe RL algorithms can be found in Appendix C.

5.1. ADVERSARIAL ATTACKER COMPARISON

We first demonstrate the vulnerability of the optimal safe RL policies without adversarial training and compare the performance of different adversaries. All the adversaries have the same ℓ ∞ norm perturbation set B ϵ ∞ restriction. We adopt three adversary baselines, including one improved version: Random attacker baseline. This is a simple baseline by sampling the corrupted observations randomly within the perturbation set via a uniform distribution. Maximum Action Difference (MAD) attacker baseline. The MAD attacker (Zhang et al., 2020a) is designed for standard RL tasks, which is shown to be effective in decreasing a trained RL agent's reward return. The optimal adversarial observation is obtained by maximizing the KL-divergence between the corrupted policy: ν MAD (s) = arg max s∈B ϵ p (s) D KL [π(a|s)∥π(a|s)] Adaptive MAD (AMAD) attacker. Since the vanilla MAD attacker is not designed for safe RL, we further improve it to an adaptive version as a stronger baseline. The motivation comes from Lemma 2 -the optimal policy will be close to the constraint boundary that with high risks. To better understand this property, we introduce the discounted future state distribution d π (s) (Kakade, 2003) , which allows us to rewrite the result in Lemma 2 as (see Appendix C.6 for derivation and implementation details): 1 1-γ s∈S d π * (s) a∈A π * (a|s) s ′ ∈S p(s ′ |s, a)c(s, a, s ′ )ds ′ dads = κ. We can see that performing MAD attack for the optimal policy π * in low-risk regions that with small p(s ′ |s, a)c(s, a, s ′ ) values may not be effective. Therefore, AMAD only perturbs the observation when the agent is within high-risk regions that are determined by the cost value function and a threshold ξ to achieve more effective attacks: ν AMAD (s) := ν MAD (s), if V π c (s) ≥ ξ, s, otherwise . Experiment setting. We evaluate the performance of all three baselines above and our MC, MR adversaries by attacking well-trained PPO-Lagrangian policies in different tasks. The trained policies can achieve nearly zero constraint violation costs without observational perturbations. We keep the trained model weights and environment seeds fixed for all the attackers to ensure fair comparisons. Experiment result. Fig. 2 shows the attack results of the 5 adversaries on PPOL-vanilla. Each column corresponds to an environment. The first row is the episode reward and the second row is the episode cost of constraint violations. We can see that the vanilla safe RL policies are vulnerable, since the safety performance deteriorates (cost increases) significantly even with a small adversarial perturbation range ϵ. Generally, we can see an increasing cost trend as the ϵ increases, except for the MAD attacker. Although MAD can reduce the agent's reward quite well, it fails to perform an effective attack in increasing the cost because the reward decrease may keep the agent away from high-risk regions. It is even worse than the random attacker in the Car-Circle task. The improved AMAD attacker is a stronger baseline than MAD, as it only attacks in high-risk regions and thus has a higher chance of entering unsafe regions to induce more constraint violations. More comparisons between MAD and AMAD can be found in Appendix C.9. Our proposed MC and MR attackers outperform all baselines attackers (Random, MAD and AMAD) in terms of effectiveness by increasing the cost by a large margin in most tasks. Surprisingly, the MR attacker can achieve even higher costs than MC and is more stealthy as it can maintain or increase the reward well, which validates our theoretical analysis and the existence of tempting policies. Results. The evaluation results of different trained policies under adversarial attacks are shown in Table 1 , where Natural represents the performance without noise. We train each algorithm with 5 random seeds and evaluate each trained policy with 50 episodes under each attacker to obtain the values. The training and testing perturbation range ϵ is the same. We use gray shadows to highlight the top two safest agents with the smallest cost values, but we ignore the failure agents whose rewards are less than 30% of the PPOL-vanilla method. We mark the failure agents with ⋆. Due to the page limit, we leave the evaluation results under random and MAD attackers to Appendix C.9. Analysis. We can observe that although most baselines can achieve near zero natural cost, their safety performances are vulnerable under the strong MC and MR attackers, which are more effective than AMAD in inducing unsafe behaviors. Generalization to other safe RL methods. We also conduct the experiments for other types of base safe RL algorithms, including another on-policy method FOCOPS (Zhang et al., 2020b) , one offpolicy method SAC-Lagrangian Yang et al. (2021) , and one policy-gradient-free off-policy method CVPO (Liu et al., 2022) . Due to the page limit, we leave the results and detailed discussions in Appendix C.9. In summary, all the vanilla safe RL methods suffer the vulnerability issue -though they are safe in noise-free environments, they are not safe anymore under strong attacks, which validates the necessity of studying the observational robustness of safe RL agents. In addition, the adversarial training can help to improve the robustness and make the FOCOPS agent much safer under attacks. Therefore, the problem formulations, methods, results, and analysis can be generalized to different safe RL approaches, hopefully attracting more attention in the safe RL community to study the inherent connection between safety and robustness.

6. CONCLUSION

We study the observational robustness regarding constraint satisfaction for safe RL and show that the optimal policy of tempting problems could be vulnerable. We propose two effective attackers to induce unsafe behaviors. An interesting and surprising finding is that maximizing-reward attack is as effective as directly maximizing the cost while keeping stealthiness. We further propose an adversarial training method to increase the robustness and safety performance, and extensive experiments show that the proposed method outperforms the robust training techniques for standard RL settings. One limitation of this work is that the adversarial training pipeline could be expensive for real-world RL applications because it requires to attack the behavior agents when collecting data. In addition, the adversarial training might be unstable for high-dimensional and complex problems. Nevertheless, our results show the existence of a previously unrecognized problem in safe RL, and we hope this work encourages other researchers to study safety from the robustness perspective, as both safety and robustness are important ingredients for real-world deployment.

A.2 PROOF OF LEMMA 2 -OPTIMAL POLICY'S COST VALUE

Lemma 2 says that in augmented policy space Π, the optimal policy π * of a tempting safe RL problem satisfies: V π * c (µ 0 ) = κ. It is clear to see that the temping policy space and the original policy space are subsets of the augmented policy space: Π T M ⊂ Π ⊂ Π. We then prove Lemma 2 by contradiction. Proof. Suppose the optimal policy π * (a|s, o) in augmented policy space for a tempting safe RL problem has V π * c (µ 0 ) < κ and its option update function is π * o . Denote π ′ ∈ Π T M as a tempting policy. Based on Lemma 1, we know that V π ′ c (µ 0 ) > κ and V π ′ r (µ 0 ) > V π * r (µ 0 ). Then we can compute a weight α: α = κ -V π * c (µ 0 ) V π ′ c (µ 0 ) -V π * c (µ 0 ) . ( ) We can see that: αV π ′ c (µ 0 ) + (1 -α)V π * c (µ 0 ) = κ. (9) Now we consider the augmented space Π. Since Π ⊆ Π, π * , π ′ ∈ Π, and then we further define another policy π based on the trajectory-wise mixture of π * and π ′ as π(a t |s t , o t ) = π ′ (a t |s t ), if o t = 1 π * (a t |s t , u t ), if o t = 0 with o t+1 = o t , o 0 ∼ Bernoulli(α) and the update of u follows the definition of π * o . Therefore, the trajectory of π has α probability to be sampled from π ′ and 1 -α probability to be sampled from π * : τ ∼ π := τ ∼ π ′ , with probability α, τ ∼ π * , with probability 1 -α. Then we can conclude that π is also feasible: V π c (µ 0 ) = E τ ∼π [ ∞ t=0 γ t c t ] = αE τ ∼π ′ [ ∞ t=0 γ t c t ] + (1 -α)E τ ∼π * [ ∞ t=0 γ t c t ] (12) = αV π ′ c (µ 0 ) + (1 -α)V π * c (µ 0 ) = κ. In addition, π has higher reward return than the optimal policy π * : V π r (µ 0 ) = E τ ∼π [ ∞ t=0 γ t r t ] = αE τ ∼π ′ [ ∞ t=0 γ t r t ] + (1 -α)E τ ∼π * [ ∞ t=0 γ t r t ] (14) = αV π ′ r (µ 0 ) + (1 -α)V π * r (µ 0 ) (15) > αV π * r (µ 0 ) + (1 -α)V π * r (µ 0 ) = V π * r (µ 0 ), where the inequality comes from the definition of the tempting policy. Since π is both feasible, and has strictly higher reward return than the policy π * , we know that π * is not optimal, which contradicts to our assumption. Therefore, the optimal policy π * should always satisfy V π * c (µ 0 ) = κ. Remark 2. The cost value function V π * c (µ 0 ) = E τ ∼π [ ∞ t=0 γ t c t ] is based on the expectation of the sampled trajectories (expectation over episodes) rather than a single trajectory (expectation within one episode), because for a single sampled trajectory τ ∼ π, V π * c (τ ) = ∞ t=0 γ t c t may even not necessarily satisfy the constraint. Remark 3. The proof also indicates that the range of metric function V := {(V π r (µ 0 ), V π c (µ 0 ))} (as shown as the blue circle in Fig. 1 ) is convex when we extend Π to a linear mixture of Π, i.e., let O = {1, 2, 3, . . . } and Π : S × A × O → [0, 1]. Consider α = [α 1 , α 2 , . . . ], α i ≥ 0, i=1 α i = 1, π = [π 1 , π 2 , . . . ]. We can construct a policy π ∈ Π = ⟨α, π⟩: π(a t |s t , o t ) =    π 1 (a t |s t ), if o t = 1 π 2 (a t |s t ), if o t = 2 ... (17) with o t+1 = o t , Pr(o 0 = i) = α i . Then we have τ ∼ ⟨α, π⟩ := τ ∼ π i , with probability α i , i = 1, 2, . . . , Similar to the above proof, we have V ⟨α,π⟩ f (µ 0 ) = ⟨α, V π f (µ 0 )⟩, f ∈ {r, c}, where V π f (µ 0 ) = [V π1 f (µ 0 ), V π2 f (µ 0 ), . . . ]. Consider ∀(v r1 , v c1 ), (v r2 , v c2 ) ∈ V, suppose they correspond to policy mixture ⟨α, π⟩ and ⟨β, π⟩ respectively, then ∀t ∈ [0, 1], the new mixture ⟨tα + (1 -t)β, π⟩ ∈ Π and V ⟨tα+(1-t)β,π⟩ f (µ 0 ) = t • v f 1 + (1 -t) • v f 2 ∈ V. Therefore, V is a convex set. Remark 4. The enlarged policy space guarantees the validation of Lemma 2 but is not always indispensable. In most environments, with the original policy space Π, the metric function space V := {(V π r (µ 0 ), V π c (µ 0 )) | π ∈ Π} is a connected set (i.e., there exists a π ∈ Π such that V π c (µ 0 ) = κ if there are π 1 , π 2 ∈ Π, s.t.V π1 c (µ 0 ) < κ < V π2 c (µ 0 ) ) and we can obtain the optimal policy exactly on the constraint boundary without expanding policy space.

A.3 PROOF OF THEOREM 1 -EXISTENCE OF OPTIMAL DETERMINISTIC MC/MR ADVERSARY

Existence. Given a fixed policy π, We first introduce two adversary MDPs Mr = (S, Â, P, Rr , γ) for reward maximization adversary and Mc = (S, Â, P, Rc , γ) for cost maximization adversary to prove the existence of optimal adversary. In adversary MDPs, the adversary acts as the agent to choose a perturbed state as the action (i.e., â = s) to maximize the cumulative reward R. Therefore, in adversary MDPs, the action space Â = S and ν(•|s) denotes a policy distribution. , â ∈ B ϵ p (s) -C, â / ∈ B ϵ p (s) , f ∈ {r, c}, where â = s ∼ ν(•|s) and C is a constant. Therefore, with sufficiently large C, we can guarantee that the optimal adversary ν * will not choose a perturbed state â out of the l p -ball of the given state s, i.e., ν * (â|s) = 0, ∀â / ∈ B ϵ p (s). According to the properties of MDP Sutton et al. (1998), Mr , Mc have corresponding optimal policy ν * r , ν * c , which are deterministic by assigning unit mass probability to the optimal action â for each state. Next, we will prove that ν * r = ν MR , ν * c = ν MC . Consider value function in Mf , f ∈ {r, c}, for an adversary ν ∈ N := {ν|ν * (â|s) = 0, ∀â / ∈ B ϵ p (s)}, we have V ν f (s) = E â∼ν(•|s),s ′ ∼ p(•|s,â) [ Rf (s, â, s ′ ) + γ V ν f (s ′ )] (22) = â ν(â|s) s ′ p(s ′ |s, â)[ Rf (s, â, s ′ ) + γ V ν f (s ′ )] (23) = â ν(â|s) s ′ a π(a|â)p(s ′ |s, a) a π(a|â)p(s ′ |s, a)f (s, a, s ′ ) a π(a|â)p(s ′ |s, a) + γ V ν f (s ′ ) (24) = s ′ p(s ′ |s, a) a π(a|â) â ν(â|s)[f (s, a, s ′ ) + γ V ν f (s ′ )] (25) = s ′ p(s ′ |s, a) a π(a|ν(s))[f (s, a, s ′ ) + γ V ν f (s ′ )]. Recall the value function in original safe RL problem, V π•ν f (s) = s ′ p(s ′ |s, a) a π(a|ν(s))[f (s, a, s ′ ) + γV π•ν f (s ′ )]. (27) Therefore, V π•ν f (s) = V ν f (s), ν ∈ N . Note that in adversary MDPs ν * f ∈ N and ν * f = arg max ν E a∼π(•|ν(s)),s ′ ∼p(•|s,a) [f (s, a, s ′ ) + γ V ν f (s ′ )]. We also know that ν * f is deterministic, ⇒ ν * f (s) = arg max ν E a∼π(•|s),s ′ ∼p(•|s,a) [f (s, a, s ′ ) + γ V ν f (s ′ )] (29) = arg max ν E a∼π(•|s),s ′ ∼p(•|s,a) [f (s, a, s ′ ) + γV π•ν f (s ′ )] (30) = arg max ν V π•ν f (s, a). Therefore, ν * r = ν MR , ν * c = ν MC . Optimality. We will prove the optimality by contradiction. By definition, ∀s ∈ S, V π•ν ′ c (s 0 ) ≤ V π•νMC c (s 0 ). ( ) Suppose ∃ν ′ , s.t.V π•ν ′ c (µ 0 ) > V π•νMC c (µ 0 ) , then there also exists s 0 ∈ S, s.t.V π•ν ′ c (s 0 ) > V π•νMC c (s 0 ) , which is contradictory to Eq.( 32). Similarly, we can also prove that the property holds for ν MR by replacing V π•ν c with V π•ν r . Therefore, there is no other adversary that achieves higher attack effectiveness than ν MR or higher reward stealthiness than ν MR .

A.4 PROOF OF THEOREM 2 -ONE-STEP ATTACK COST BOUND

We have Ṽ π,ν c (s) = E a∼π(•|ν(s)),s ′ ∼p(•|s,a) [c(s, a, s ′ ) + γV π c (s ′ )]. (33) By Bellman equation, V π c (s) = E a∼π(•|s),s ′ ∼p(•|s,a) [c(s, a, s ′ ) + γV (s ′ )]. For simplicity, denote p s ′ sa = p(s ′ |s, a) and we have  Ṽ π,ν c (s) -V π c (s) = Ṽ π,ν c (s) -V π c (s) ≤ 2D T V [π(•|ν(s)∥π(•|s)] max a∈A s∈Sc p s ′ sa c(s, a, s ′ ) + s∈S p s ′ sa γV π c (s ′ ) (37) ≤ 2L∥ν(s) -s∥ p max a∈A s∈Sc p s ′ sa C m + s∈S p s ′ sa γ C m 1 -γ (38) ≤ 2Lϵ p s C m + γC m 1 -γ . ( ) A.5 PROOF OF THEOREM 3 -EPISODIC ATTACK COST BOUND According to the Corollary 2 in CPO (Achiam et al., 2017) , V π•ν c (µ 0 ) -V π c (µ 0 ) ≤ 1 1 -γ E s∼dπ,a∼π•ν A π c (s, a) + 2γδ π•ν c 1 -γ D TV [π ′ (•|s)∥π(•|s)] , ( ) where δ π•ν c = max s |E a∼π•ν A π c (s, a)| and A π c (s, a) = E s ′ ∼p(•|s,a) [c(s, a, s ′ ) + γV π c (s ′ ) -V π c (s)] denotes the advantage function. Note that E a∼π•ν A π c (s, a) = E a∼π•ν [E s ′ ∼p(•|s,a) [c(s, a, s ′ ) + γV π c (s ′ ) -V π c (s)]] (41) = E a∼π•ν,s ′ ∼p(•|s,a) [c(s, a, s ′ ) + γV π c (s ′ )] -V π c (s) (42) = Ṽ π,ν c (s) -V π c (s). ( ) By theorem 2, δ π•ν c = max s |E a∼π•ν A π c (s, a)| (44) ≤ max s 2Lϵ p s C m + γC m 1 -γ (45) = 2LϵC m max s p s + γ 1 -γ . ( ) Therefore, we can derive V π•ν c (µ 0 ) -V π c (µ 0 ) ≤ 1 1 -γ max s |E a∼π•ν A π c (s, a)| + 2γδ π•ν c (1 -γ) 2 D TV [π ′ (•|s)∥π(•|s)] (47) = 1 1 -γ + 2γD TV (1 -γ) 2 δ π•ν c (48) ≤ 2LϵC m 1 1 -γ + 4γLϵ (1 -γ) 2 max s p s + γ 1 -γ . ( ) Note π is a feasible policy, i.e., V π c (µ 0 ) ≤ κ. Therefore, V π•ν c (µ 0 ) ≤ κ + 2LϵC m 1 1 -γ + 4γLϵ (1 -γ) 2 max s p s + γ 1 -γ . ( ) A.6 PROOF OF THEOREM 4 AND PROPOSITION 1 -BELLMAN CONTRACTION Recall Theorem 4, the Bellman policy operator T π is a contraction under the sup-norm ∥ • ∥ ∞ and will converge to its fixed point. The Bellman policy operator is defined as: (T π V π•ν f )(s) = a∈A π(a|ν(s)) s ′ ∈S p(s ′ |s, a) f (s, a, s ′ ) + γV π•ν f (s ′ ) , f ∈ {r, c}, The proof is as follows: Proof. Denote f s ′ sa = f (s, a, s ′ ), f ∈ {r, c} and p s ′ sa = p(s ′ |s, a) for simplicity, we have: (T π U π•ν f )(s) -(T π V π•ν f )(s) = a∈A π(a|ν(s)) s ′ ∈S p s ′ sa f s ′ sa + γU π•ν f (s ′ ) (52) - a∈A π(a|ν(s)) s ′ ∈S p s ′ sa f s ′ sa + γV π•ν f (s ′ ) (53) = γ a∈A π(a|ν(s)) s ′ ∈S p s ′ sa U π•ν f (s ′ ) -V π•ν f (s ′ ) (54) ≤ γ max s ′ ∈S U π•ν f (s ′ ) -V π•ν f (s ′ ) (55) = γ U π•ν f (s ′ ) -V π•ν f (s ′ ) ∞ , Since the above holds for any state s, we have: max s (T π U π•ν f )(s) -(T π V π•ν f )(s) ≤ γ U π•ν f (s ′ ) -V π•ν f (s ′ ) ∞ , which implies that: (T π U π•ν f )(s) -(T π V π•ν f )(s) ∞ ≤ γ V π•ν2 f (s ′ ) -V π•ν2 f (s ′ ) ∞ , Then based on the Contraction Mapping Theorem (Meir & Keeler, 1969) , we know that T π has a unique fixed point V * f (s), f ∈ {r, c} such that V * f (s) = (T π V * f )(s). With the proof of Bellman contraction, we show that why we can perform adversarial training successfully under observational attacks. Since the Bellman operator is a contraction for both reward and cost under adversarial attacks, we can accurately evaluate the performance of the corrupted policy in the policy evaluation phase. This is a crucial and strong guarantee for the success of adversarial training, because we can not improve the policy without well-estimated values. Propisition 1 states that suppose a trained policy π ′ under the MC attacker satisfies: V π ′ •νMC c (µ 0 ) ≤ κ, then π ′ • ν is guaranteed to be feasible with any B ϵ p bounded adversarial perturbations. Similarly, suppose a trained policy π ′ under the MR attacker satisfies: V π ′ •νMR c (µ 0 ) ≤ κ, then π ′ • ν is guaranteed to be non-tempting with any B ϵ p bounded adversarial perturbations. Before proving it, we first give the following definitions and lemmas. Definition 6. Define the Bellman adversary effectiveness operator as T * c : R |S| -→ R |S| : (T * c V π•ν c )(s) = max s∈B ϵ p (s) a∈A π(a|s) s ′ ∈S p(s ′ |s, a) [c(s, a, s ′ ) + γV π•ν c (s ′ )] . ( ) Definition 7. Define the Bellman adversary reward stealthiness operator as T * r : R |S| -→ R |S| : (T * r V π•ν r )(s) = max s∈B ϵ p (s) a∈A π(a|s) s ′ ∈S p(s ′ |s, a) [r(s, a, s ′ ) + γV π•ν r (s ′ )] . ( ) Recall that B ϵ p (s) is the ℓ p ball to constrain the perturbation range. The two definitions correspond to computing the value of the most effective and the most reward-stealthy attackers, which is similar to the Bellman optimality operator in the literature. We then show their contraction properties via the following Lemma: Lemma 3. The Bellman operators T * c , T * r are contractions under the sup-norm ∥ • ∥ ∞ and will converge to their fixed points, respectively. The fixed point for T * c is V π•νMC c = T * c V π•νMC c , and the fixed point for T * r is V π•νMR r = T * r V π•νMR r . To finish the proof of Lemma 3, we introduce another lemma: Lemma 4. Suppose max x h(x) ≥ max x g(x) and denote x h * = arg max x h(x), we have: | max x h(x) -max x g(x)| = max x h(x) -max x g(x) = h(x h * ) -max x g(x) ≤ h(x h * ) -g(x h * ) ≤ max x |h(x) -g(x)|. ( ) We then prove the Bellman contraction properties of Lemma 3: Proof. (T * f V π•ν1 f )(s) -(T * f V π•ν2 f )(s) = max s∈B ϵ p (s) a∈A π(a|s) s ′ ∈S p s ′ sa f s ′ sa + γV π•ν1 f (s ′ ) -max s∈B ϵ p (s) a∈A π(a|s) s ′ ∈S p s ′ sa f s ′ sa + γV π•ν2 f (s ′ ) (61) = γ max s∈B ϵ p (s) a∈A π(a|s) s ′ ∈S p s ′ sa V π•ν1 f (s ′ ) -V π•ν2 f (s ′ ) (62) ≤ γ max s∈B ϵ p (s) a∈A π(a|s) s ′ ∈S p s ′ sa V π•ν1 f (s ′ ) -V π•ν2 f (s ′ ) (63) ∆ = γ a∈A π(a|s * ) s ′ ∈S p s ′ sa V π•ν1 f (s ′ ) -V π•ν2 f (s ′ ) (64) ≤ γ max s ′ ∈S V π•ν1 f (s ′ ) -V π•ν2 f (s ′ ) (65) = γ V π•ν1 f (s ′ ) -V π•ν2 f (s ′ ) ∞ , where inequality (63) comes from Lemma 4, and s * in Eq. ( 64) denote the argmax of the RHS. Since the above holds for any state s, we can also conclude that: (T * f V π•ν1 f )(s) -(T * f V π•ν2 f )(s) ∞ ≤ γ V π•ν2 f (s ′ ) -V π•ν2 f (s ′ ) ∞ , After proving the contraction, we prove that the value function of the MC and MR adversaries V π•νMC c (s), V π•νMR r (s) are the fixed points for T * c , T * r as follows: Proof. Recall that the MC, MR adversaries are: ν MC (s) = arg max s∈B ϵ p (s) E ã∼π(a|s) [Q π c (s, ã))] , ν MR (s) = arg max s∈B ϵ p (s) E ã∼π(a|s) [Q π r (s, ã))] . Based on the value function definition, we have: V π•νMC c (s) = E τ ∼π•νMC,s0=s [ ∞ t=0 γ t c t ] = E τ ∼π•νMC,s0=s [c 0 + γ ∞ t=1 γ t-1 c t ] = a∈A π(a|ν MC (s)) s ′ ∈S p s ′ sa c(s, a, s ′ ) + γE τ ∼π•νMC,s1=s ′ [ ∞ t=1 γ t-1 c t ] = a∈A π(a|ν MC (s)) s ′ ∈S p s ′ sa [c(s, a, s ′ ) + γV π•νMC c (s ′ )] = max s∈B ϵ p (s) a∈A π(a|s) s ′ ∈S p s ′ sa [c(s, a, s ′ ) + γV π•νMC c (s ′ )] = (T * c V π•νMC c )(s), where Eq. ( 71) is from the MC attacker definition. Therefore, the cost value function of the MC attacker V π•νMC c is the fixed point of the Bellman adversary effectiveness operator T * c . With the same procedure (replacing ν MC , T * c with ν MR , T * r ), we can prove that the reward value function of the MR attacker V π•νMR r is the fixed point of the Bellman adversary stealthiness operator T * r . With Lemma 3 and the proof above, we can easily obtain the conclusions in Remark 1: if the trained policy is safe under the MC or the MR attacker, then it is guaranteed to be feasible or nontempting under any B ϵ p (s) bounded adversarial perturbations respectively, since there are no other attackers can achieve higher cost or reward returns than them. It provides theoretical guarantees of the safety of adversarial training under the MC and MR attackers. The adversarial trained agents under the proposed attacks are guaranteed to be safe or non-tempting under any bounded adversarial perturbations. We believe the above theoretical guarantees are crucial for the success of our adversarial training agents, because from our ablation studies, we can see adversarial training can not achieve desired performance with other attackers.

B REMARKS B.1 REMARKS OF THE SAFE RL SETTING, STEALTHINESS, AND ASSUMPTIONS

Safe RL setting regarding the reward and the cost. We consider the safe RL problems that have separate task rewards and constraint violation costs, i.e. independent reward and cost functions. Combining the cost with reward to a single scalar metric, which can be viewed as manually selecting Lagrange multipliers, may work in simple problems. However, it lacks interpretability -it is hard to explain what does a single scalar value mean, and requires good domain knowledge of the problem -the weight between costs and rewards should be carefully balanced, which is difficult when the task rewards already contain many objectives/factors. On the other hand, separating the costs from rewards is easy to monitor the safety performance and task performance respectively, which is more interpretable and applicable for different cost constraint thresholds. Determine temptation status of a safe RL problem. According to Def. 1-3, no tempting policy indicates a non-tempting safe RL problem, where the optimal policy has the highest reward while satisfying the constraint. However, for the safe deployment problem that only cares about safety after training, no tempting policy means that the cost signal is unnecessary for training, because one can simply focus on maximizing the reward. As long as the most rewarding policies are found, the safety requirement would be automatically satisfied, and thus many standard RL algorithms can solve the problem. Since safe RL methods are not required in this setting, the non-tempting tasks are usually not discussed in safe RL papers, and are also not the focus of this paper. From another perspective, since a safe RL problem is specified by the cost threshold κ, one can tune the threshold to change the status of temptation. For instance, if κ > max s,a,s ′ c(s, a, s ′ ), then it is guaranteed to be a non-tempting problem because all the policies satisfy the constraints, and thus we can use standard RL methods to solve it. Independently estimated reward and cost value functions assumption. Similar to most existing safe RL algorithms, such as PPO-Lagrangian Ray et al. (2019) ; Stooke et al. (2020) , CPO Achiam et al. (2017) , FOCOPS Zhang et al. (2020b) , and CVPO Liu et al. (2022) , we consider the policybased (or actor-critic-based) safe RL in this work. There are two phases for this type of approach: policy evaluation and policy improvement. In the policy evaluation phase, the reward and cost value functions V π r , V π c are evaluated separately. At this stage, the Bellman operators for reward and cost values are independent. Therefore, they have contractions (Theorem 4) and will converge to their fixed points separately. This is a commonly used treatment in safe RL papers to train the policy: first evaluating the reward and cost values independently by Bellman equations and then optimizing the policy based on the learned value estimations. Therefore, our theoretical analysis of robustness is also developed under this setting. (Reward) Stealthy attack for safe RL. As we discussed in Sec. 3.2, the stealthiness concept in supervised learning refers to that the adversarial attack should be covert to prevent from being easily identified. While we use the perturbation set B ϵ p to ensure the stealthiness regarding the observation corruption, we notice that another level of stealthiness regarding the task reward performance is interesting and worthy of being discussed. In some real-world applications, the task-related metrics (such as velocity, acceleration, goal distances) are usually easy to be monitored from sensors. However, the safety metrics can be sparse and hard to monitor until breaking the constraints, such as colliding with obstacles and entering hazard states, which are determined by binary indicator signals. Therefore, a dramatic task-related metrics (reward) drop might be easily detected by the agent, while constraint violation signals could be hard to detect until catastrophic failures. An unstealthy attack in this scenario may decrease the reward a lot and prohibit the agent from finishing the task, which can warn the agent that it is attacked and thus lead to a failing attack. On the contrary, a stealthy attack can maintain the agent's task reward such that the agent is not aware of the existence of the attacks based on "good" task metrics, while performing successful attacks by leading to constraint violations. In other words, a stealthy attack should corrupt the policy to be tempted, since all the tempting policies are high-rewarding while unsafe.

Stealthiness definition of the attacks.

There is an alternative definition of stealthiness by viewing the difference in the reward regardless of increasing or decreasing. The two-sided stealthiness is a more strict one than the one-sided lower-bound definition in this paper. However, if we consider a practical system design, people usually set a threshold for the lower bound of the task performance to determine whether the system functions properly, rather than specifying an upper bound of the performance because it might be tricky to determine what should be the upper-bound of the task performance to be alerted by the agent. For instance, an autonomous vehicle that fails to reach the destination within a certain amount of time may be identified as abnormal, while reaching the goal faster may not since it might be hard to specify such a threshold to determine what is an overly good performance. Therefore, increasing the reward with the same amount of decreasing it may not attract the same attention from the agents. In addition, finding a stealthy and effective attacker with minimum reward change might be a much harder problem with the two-sided definition, since the candidate solutions are much fewer and the optimization problem could be harder to be formulated. But we believe that this is an interesting point that is worthy to be investigated in the future, while we will focus on the one-sided definition of stealthiness in this work.

B.2 REMARKS OF THE FAILURE OF SA-PPOL(MC/MR) BASELINES

The detailed algorithm of SA-PPOL Zhang et al. (2020a) can be found in Appendix C.5. The basic idea can be summarized via the following equation: ℓ ν (s) = -D KL [π(•|s)||π θ (•|ν(s))], (73) which aims to minimize the divergence between the corrupted states and the original states. Note that we only optimize (compute gradient) for π θ (•|ν(s)) rather than π(•|s), since we view π(•|s) as the "ground-truth" target action distribution. Adding the above KL regularizer to the original PPOL loss yields the SA-PPOL algorithm. We could observe the original SA-PPOL that uses the MAD attacker as the adversary can learn well in most of the tasks, though it is not safe under strong attacks. However, SA-PPOL with MR or MC adversaries often fail to learn a meaningful policy in many tasks, especially for the MR attacker. The reason is that: the MR attacker aims to find the high-rewarding adversarial states, while the KL loss will make the policy distribution of high-rewarding adversarial states to match with the policy distribution of the original relatively lower-rewards states. As a result, the training could fail due to wrong policy optimization direction and prohibited exploration to high-rewarding states. Since the MC attacker can also lead to high-rewarding adversarial states due to the existence of tempting polices, we may also observe failure training with the MC attacker.

C IMPLEMENTATION DETAILS C.1 MC AND MR ATTACKERS IMPLEMENTATION

We use the gradient of the state-action value function Q(s, a) to provide the direction to update states adversarially in K steps (Q = Q π r for MR and Q = Q π c for MC): s k+1 = Proj[s k -η∇ s k Q(s 0 , π(s k ))], k = 0, . . . , K -1 (74) where Proj[•] is a projection to B ϵ p (s 0 ) , η is the learning rate, and s 0 is the state under attack. Since the Q-value function and policy are parametrized by neural networks, we can backpropagate the gradient from Q c or Q r to s k via π(ã|s k ), which can be solved efficiently by many optimizers like ADAM. It is related to the Projected Gradient Descent (PGD) attack, and the deterministic policy gradient method such as DDPG and TD3 in the literature, but the optimization variables are the state perturbations rather than the policy parameters. Note that we use the gradient of Q(s 0 , π(s k )) rather than Q(s k , π(s k )) to make the optimization more stable, since the Q function may not generalize well to unseen states in practice. This technique for solving adversarial attacks is also widely used in the standard RL literature and is shown to be successful, such as (Zhang et al., 2020a) . The implementation of MC and MR attacker is shown in algorithm 2. Empirically, this gradient-based method converges fast with a few iterations and within 10ms as shown in Fig. 3 , which greatly improves adversarial training efficiency. Algorithm 2 MC and MR attacker Input: A policy π under attack, corresponding Q networks, initial state s 0 , attack steps K, attacker learning rate η, perturbation range ϵ, two thresholds ϵ Q and ϵ s for early stopping Output: An adversarial state s 1: for k = 1 to K do 2: g k = ∇ s k-1 Q(s 0 , π(s k-1 )) 3: s k ← Proj[s k-1 -ηg k ] 4: Compute δQ = |Q(s 0 , π(s k )) -Q(s 0 , π(s k-1 ))| and δs = |s k -s k-1 | 5: if δQ < ϵ Q and δs < ϵ s then 6: break for early stopping 7: end if 8: end for

C.2 PPO-LAGRANGIAN ALGORITHM

The objective of PPO (clipped) has the form (Schulman et al., 2017) : Update critics Q c and Q r by Eq. ( 79) and Eq. ( 80) ℓ ppo = min( π θ (a|s) π θ k (a|s) A π θ k (s, a), clip( π θ (a|s) π θ k (a|s) , 1 -ϵ, 1 + ϵ)A π θ k (s, a)) 13: Polyak averaging target networks by Eq. ( 81) 14: Update current perturbation range 15: Update adversary based on Q c and Q r using algorithm 2 16: Linearly increase the perturbation range until to the maximum number ϵ 17: end for

C.4 MAD ATTACKER IMPLEMENTATION

The full algorithm of MAD attacker is presented in algorithm 4. We use the same SGLD optimizer as in Zhang et al. (2020a) to maximize the KL-divergence. The objective of the MAD attacker is defined as: ℓ M AD (s) = -D KL [π(•|s 0 )||π θ (•|s)] Note that we back-propagate the gradient from the corrupted state s instead of the original state s 0 to the policy parameters θ. The full algorithm is shown below: Algorithm 4 MAD attacker Input: A policy π under attack, corresponding Q(s, a) network, initial state s 0 , attack steps K, attacker learning rate η, the (inverse) temperature parameter for SGLD β, two thresholds ϵ Q and ϵ s for early stopping Output: An adversarial state s 1: for k = 1 to K do 2: Sample υ ∼ N (0, 1) 3: g k = ∇ℓ M AD (s t-1 ) + 2 βη υ 4: s k ← Proj[s k-1 -ηg k ] 5: Compute δQ = |Q(s 0 , π(s k )) -Q(s 0 , π(s k-1 ))| and δs = |s k -s k-1 | 6: if δQ < ϵ Q and δs < ϵ s then for Optimization steps m = 1, ..., M do 7: Compute KL robustness regularizer LKL = D KL (π(s)∥π θ (s)), no gradient from π(s) 8: Compute PPO-Lag loss ℓ ppol (s, π θ , r, c) by Eq. ( 76) 9: Combine them together with a weight β: ℓ = ℓ ppol (s, π θ , r, c) + β lKL 10: Update actor θ ← -θ -α∇ θ ℓ 11: end for 12: ▷ Update critics 13: Update value function based on samples {(s, a, s ′ , r, c)} N 14: end for The SA-PPO-Lagrangian algorithm adds an additional KL robustness regularizer to robustify the training policy. Choosing different adversaries ν yields different baseline algorithms. The original SA-PPOL (Zhang et al., 2020a) method adopts the MAD attacker, while we conduct ablation studies by using the MR attacker and the MC attacker, which yields the SA-PPOL(MR) and the SA-PPOL(MC) baselines respectively.

C.6 IMPROVED ADAPTIVE MAD (AMAD) ATTACKER BASELINE

To motivate the design of AMAD baseline, we denote P π (s ′ |s) = p(s ′ |s, a)π(a|s)da as the state transition kernel and p π t (s) = p(s t = s|π) as the probability of visiting the state s at the time t under the policy π, where p π t (s ′ ) = P π (s ′ |s)p π t-1 (s)ds. Then the discounted future state distribution d π (s) is defined as (Kakade, 2003) : d π (s) = (1 -γ) ∞ t=0 γ t p π t (s), which allows us to represent the value functions compactly: V π f (µ 0 ) = 1 1 -γ E s∼d π ,a∼π,s ′ ∼p [f (s, a, s ′ )] = 1 1 -γ s∈S d π (s) a∈A π(a|s) s ′ ∈S p(s ′ |s, a)f (s, a, s ′ )ds ′ dads, f ∈ {r, c} Based on Lemma 2, the optimal policy π * in a tempting safe RL setting satisfies: 1 1 -γ s∈S d π * (s) a∈A π * (a|s) s ′ ∈S p(s ′ |s, a)c(s, a, s ′ )ds ′ dads = κ. We can see that performing MAD attack in low-risk regions that with small p(s ′ |s, a)c(s, a, s ′ ) values may not be effective -the agent may not even be close to the safety boundary. On the other hand, perturbing π when p(s ′ |s, a)c(s, a, s ′ ) is large may have higher chance to result in constraint violations. Therefore, we improve the MAD to the Adaptive MAD attacker, which will only attack the agent in high-risk regions (determined by the cost value function and a threshold ξ). The implementation of AMAD is shown in algorithm 6. Given a batch of states {s} N , we compute the cost values {V π c (s)} N and sort them in ascending order. Then we select certain percentile of {V π c (s)} N as the threshold ξ and attack the states that have higher cost value than ξ. We use the Bullet safety gym (Gronauer, 2022) environments for this set of experiments. In the Circle tasks, the goal is for an agent to move along the circumference of a circle while remaining within a safety region smaller than the radius of the circle. The reward and cost functions are defined as: r(s) = -yv x + xv y 1 + | x 2 + y 2 -r| + r robot (s) c(s) = 1(|x| > x lim ) where x, y are the position of the agent on the plane, v x , v y are the velocities of the agent along the x and y directions, r is the radius of the circle, and x lim specified the range of the safety region, r robot (s) is the specific reward for different robot. For example, an ant robot will gain reward if its feet do not collide with each other. In the Run tasks, the goal for an agent is to move as far as possible within the safety region and the speed limit. The reward and cost functions are defined as: r(s) = (x t-1 -g x ) 2 + (y t-1 -g y ) 2 -(x t -g x ) 2 + (y t -g y ) 2 + r robot (s) c(s) = 1(|y| > y lim ) + 1( v 2 x + v 2 y > v lim ) where v lim is the speed limit and g x and g y is the position of a fictitious target. The reward is the difference between current distance to the target and the distance in the last timestamp.

C.8 HYPER-PARAMETERS

In all experiments, we use Gaussian policies with mean vectors given as the outputs of neural networks, and with variances that are separate learnable parameters. For the Car-Run experiment, the policy networks and Q networks consist of two hidden layers with sizes of (128, 128). For other experiments, they have two hidden layers with sizes of (256, 256). In both cases, the ReLU activation function is used. We use a discount factor of γ = 0.995, a GAE-λ for estimating the regular advantages of λ GAE = 0.97, a KL-divergence step size of δ KL = 0.01, a clipping coefficient of 0.02. The PID parameters for the Lagrange multiplier are: K p = 0.1, K I = 0.003, and K D = 0.001. The learning rate of the adversarial attackers: MAD, AMAD, MC, and MR is 0.05. The optimization steps of MAD and AMAD is 60 and 200 for MC and MR attacker. The threshold ξ for AMAD is 0.1. The complete hyperparameters used in the experiments are shown in Table 2 . We choose larger perturbation range for the Car robot-related tasks because they are simpler and easier to train. The experiments for the minimizing reward attack for our method are shown in Table 3 . We can see that the minimizing reward attack does not have an effect on the cost since it remains below the constraint violation threshold. Besides, we adopted one SOTA attack method (MAD) in standard RL as a baseline, and improve it (AMAD) in the safe RL setting. The results, however, demonstrate that they do not perform well. As a result, it does not necessarily mean that the attacking methods and robust training methods in standard RL settings still perform well in the safe RL setting. Table 3 : Evaluation results under Minimum Reward attacker. Each value is reported as: mean and the difference between the natural performance for 50 episodes and 5 seeds. The experiment results of FOCOPS (Zhang et al., 2020b) is shown in Table 4 . We trained FO-COPS without adversarial attackers FOCOPS-vanilla and with our adversarial training methods FOCOPS(MR) and FOCOPS(MR) under the MC and MR attackers respectively. We can see that the vanilla method is safe in noise-free environments, however, they are not safe anymore under the proposed adversarial attack. In addition, the adversarial training can help to improve the robustness and make the FOCOPS agents much safer under strong attacks, which means that our adversarial training method is generalizable to different safe RL methods. We evaluate the performance of MAD and AMAD adversaries by attacking well-trained PPO-Lagrangian policies. We keep the policies' model weights fixed for all the attackers. The comparison is in Fig. 4 . We vary the attacking fraction (determined by ξ) to thoroughly study the effectiveness of the AMAD attacker. We can see that AMAD attacker is more effective because the cost increases significantly with the increase in perturbation, while the reward is maintained well. This validates our hypothesis that attacking the agent in high-risk regions is more effective and stealthy. 5 . The last column shows the average rewards and costs over all the 5 attackers (Random, MAD, AMAD, MC, MR). Our agent (ADV-PPOL) with adversarial training is robust against all the 5 attackers and achieves the lowest cost. We can also see that AMAD attacker is more effective than MAD since the cost under the AMAD attacker is higher than the cost under the MAD attacker. Table 5 : Evaluation results of natural performance (no attack) and under Random and MAD attackers. The average column shows the average rewards and costs over all 5 attackers (Random, MAD, AMAD, MC, and MR). Our methods are ADV-PPOL(MC/MR). Each value is reported as: mean ± standard deviation for 50 episodes and 5 seeds. We shadow two lowest-costs agents under each attacker column and break ties based on rewards, excluding the failing agents (whose natural rewards are less than 30% of PPOL-vanilla's). We mark the failing agents with ⋆. The experiments results of CVPO (Liu et al., 2022) is shown in Table 8 . We can that the vanilla version is not robust against adversarial attackers since the cost is much larger after being attacked. Based on the conducted experiments of SAC-Lagrangian, FOCOPS, and CVPO, we can conclude that the vanilla version of them all suffer from vulnerability issues: though they are safe in noise-free environments, they are no longer safe under strong MC and MR attacks, which validate that our proposed methods and theories could be applied to a general safe RL setting.



https://sites.google.com/view/robustsaferl/home



Figure 1: Illustration of definitions via a mapping from the policy space to the metric plane Π -→ R 2 , where the x-axis is the reward return and the y-axis is the cost return. A point on the metric plane denotes corresponding policies, i.e., the point (vr, vc) represents the policies {π ∈ Π|V π r (µ0) = vr, V π c (µ0) = vc}. The blue and green circles denote the policy space of two safe RL problems.

Adversarial safe RL training meta algorithm Input: Safe RL learner, Adversary scheduler Output: Observational robust policy π 1: Initialize policy π ∈ Π and adversary ν : S -→ S 2: for each training epoch n = 1, ..., N do 3: Rollout trajectories: τ = {s 0 , ã0 , ...} T , ãt ∼ π(a|ν(s t )) 4: Run safe RL learner: π ←learner(τ , Π) 5: Update adversary: ν ←scheduler(τ , π, n) 6: end for The meta adversarial training algorithm is shown in Algo. 1.

Figure 2: Reward and cost curves of all 5 attackers evaluated on well-trained vanilla PPO-Lagrangian models w.r.t. the perturbation range ϵ. The curves are averaged over 50 episodes and 5 seeds, where the solid lines are the mean and the shadowed areas are the standard deviation. The dashed line is the cost without perturbations.

Based on the above definitions, we can also derive transition function and reward function for new MDPs Zhang et al. (2020a) p(s ′ |s, a) = a π(a|â)p(s ′ |s, a),(20)Rf (s, â, s ′ ) = a π(a|â)p(s ′ |s,a)f (s,a,s ′ ) a π(a|â)p(s ′ |s,a)

sa (c(s, a, s ′ ) + γV π c (s ′ )). (36) By definition, D TV [π(•|ν(s)∥π(•|s)] = a∈A |π(a|ν(s)) -π(a|s)|, and c(s, a, s ′ ) = 0, s ′ ∈ S c . Therefore, we have

Algorithm Robust PPO-Lagrangian Algorithm Input: rollouts T , policy optimization steps M , PPO-Lag loss function ℓ ppol (s, π θ , r, c), adversary function ν(s), policy parameter θ, critic parameter ϕ r and ϕ c , target critic parameter ϕ ′ r and ϕ ′ c Initialize policy parameters and critics parameters 2: for each training iteration do 3: Rollout T trajectories by π θ • ν from the environment {(ν(s), ν(a), ν(s ′ ), r, c)} N Lag loss ℓ ppol (s, π θ , r, c) by Eq. (76) 8: Update actor θ ← -θ -α∇ θ ℓ ppo based on samples {(s, a, s ′ , r, c)} N 11: Update adversary scheduler 12:

-PPO-LAGRANGIAN BASELINE Algorithm 5 SA-PPO-Lagrangian Algorithm Input: rollouts T , policy optimization steps M , PPO-Lag loss function ℓ ppo (s, π θ , r, c), adversary function ν(s) Output: policy π θ 1: Initialize policy parameters and critics parameters 2: for each training iteration do 3:Rollout T trajectories by π θ from the environment {(s, a, s ′ , r, c)} N

Figure 4: Reward and cost of AMAD and MAD attacker

The proposed adversarial training methods (ADV-PPOL) consistently outperform baselines in safety with the lowest costs while maintaining high rewards in most tasks. The comparison with PPOL-random indicates that the MC and MR attackers are essential ingredients of adversarial training. Although SA-PPOL agents can maintain reward very well, they are not safe as to constraint satisfaction under adversarial perturbations in most environments. Evaluation results of natural performance (no attack) and under 3 attackers. Our methods are ADV-PPOL(MC/MR). Each value is reported as: mean ± standard deviation for 50 episodes and 5 seeds. We shadow two lowest-costs agents under each attacker column and break ties based on rewards, excluding the failing agents (whose natural rewards are less than 30% of PPOL-vanilla's). We mark the failing agents with ⋆.

Algorithm 6 AMAD attacker Input: a batch of states {s} N , threshold ξ, a policy π under attack, corresponding Q(s, a) network, initial state s 0 , attack steps K, attacker learning rate η, the (inverse) temperature parameter for SGLD β, two thresholds ϵ Q and ϵ s for early stopping

Hyperparameters for all the environments Parameter Car-Run Drone-Run Ant-Run Car-Circle Drone-Circle Ant-Circle All the experiments are performed on a server with AMD EPYC 7713 64-Core Processor CPU. For each experiment, we use 4 CPUs to train each agent that is implemented by PyTorch, and the training time varies from 4 hours (Car-Run) to 7 days (Ant-Circle). Video demos are available at: https://sites.google.com/view/robustsaferl/home

Evaluation results of natural performance (no attack) and under MAD, MC, and MR attackers of FOCOPS. Each value is reported as: mean ± standard deviation for 50 episodes and 5 seeds.

Evaluation results of natural performance (no attack) and under MAD, MC, and MR attackers of CVPO. Each value is reported as: mean ± standard deviation for 50 episodes and 5 seeds.

ACKNOWLEDGMENTS

We gratefully acknowledge support from the National Science Foundation under grant CAREER CNS-2047454.

annex

Lemma 1 indicates that all the tempting policies are infeasible: ∀π ∈ Π T M , V π c (µ 0 ) > κ. We will prove it by contradiction.Proof. For a tempting safe RL problem M κ Π , there exists a tempting policy that satisfies the constraint:Denote the optimal policy as π * , then based on the definition of the tempting policy, we have V π ′ r (µ 0 ) > V π * r (µ 0 ). Based on the definition of optimality, we know that for any other feasible policy π ∈ Π κ M , we have:, which indicates that π ′ is the optimal policy for M κ Π . Then again, based on the definition of tempting policy, we will obtain:. Therefore, there is no tempting policy that satisfies the constraint.Proposition 1 suggest that as long as the MR attacker can successfully obtain a policy that has higher reward return than the optimal policy π * given enough large perturbation set B ϵ p (s), it is guaranteed to be reward stealthy and effective.Proof. The stealthiness is naturally satisfied based on the definition. The effectiveness is guaranteed by Lemma 1. Since the corrupted policy πr , we can conclude that π * • ν MR is within the tempting policy class, since it has higher reward than the optimal policy. Then we know that it will violate the constraint based on Lemma 1, and thus the MR attacker is effective. The Lagrangian multiplier λ is computed by applying feedback control to V π c and is determined by K P , K I , and K D that need to be fine-tuned.

C.3 ADVERSARIAL TRAINING FULL ALGORITHM

Due to the page limit, we omit some implementation details in the main content. We will present the full algorithm and some implementation tricks in this section. Without otherwise statement, the critics' and policies' parameterization is assumed to be neural networks (NN), while we believe other parameterization form should also work well. Denote α c as the critics' learning rate, we have the following updating equations:Note that the original PPO-Lagrangian algorithm is an on-policy algorithm, which doesn't require the reward critic and cost critic to train the policy. We learn the critics because the MC and MR attackers require them, which is an essential module for adversarial training.Polyak averaging for the target networks. The polyak averaging is specified by a weight parameter ρ ∈ (0, 1) and updates the parameters with:The critic's training tricks are widely adopted in many off-policy RL algorithms, such as SAC, DDPG and TD3. We observe that the critics trained with those implementation tricks work well in practice. Then we present the full Robust PPO-Lagrangian algorithm:Published as a conference paper at ICLR 2023The experiment results of the maximum-entropy method: SAC-Lagrangian is shown in Table 6 . We evaluated the effect of different entropy regularizers α on the robustness against observational perturbation. Although the trained agents can achieve almost zero constraint violations in noisefree environments, they suffer from vulnerability issues under the proposed MC and MR attacks. Increasing the entropy cannot make the agent more robust against adversarial attacks. Linearly combined MC and MR attacker. The experiment results of trained safe RL policies under the mixture of MC and MR attackers are shown in Figure 5 and some detailed results are shown in Table 7 . The mixed attacker is computed as the linear combination of MC and MR objectives, namely, w × M C + (1 -w) × M R, where w ∈ [0, 1] is the weight. Our agent (ADV-PPOL) with adversarial training is robust against the mixture attacker. However, there is no obvious trend to show which weight performs the best attack. In addition, we believe the performance in practice is heavily dependent on the quality of the learned reward and cost Q functions. If the reward Q function is learned to be more robust and accurate than the cost Q function, then giving larger weight to the reward Q should achieve better results, and vice versa. 

