ON THE ROBUSTNESS OF SAFE REINFORCEMENT LEARNING UNDER OBSERVATIONAL PERTURBATIONS

Abstract

Safe reinforcement learning (RL) trains a policy to maximize the task reward while satisfying safety constraints. While prior works focus on the performance optimality, we find that the optimal solutions of many safe RL problems are not robust and safe against carefully designed observational perturbations. We formally analyze the unique properties of designing effective observational adversarial attackers in the safe RL setting. We show that baseline adversarial attack techniques for standard RL tasks are not always effective for safe RL and propose two new approaches -one maximizes the cost and the other maximizes the reward. One interesting and counter-intuitive finding is that the maximum reward attack is strong, as it can both induce unsafe behaviors and make the attack stealthy by maintaining the reward. We further propose a robust training framework for safe RL and evaluate it via comprehensive experiments. This paper provides a pioneer work to investigate the safety and robustness of RL under observational attacks for future safe RL studies. Code is available at: https://github.com/liuzuxin/ safe-rl-robustness 

1. INTRODUCTION

Despite the great success of deep reinforcement learning (RL) in recent years, it is still challenging to ensure safety when deploying them to the real world. Safe RL tackles the problem by solving a constrained optimization that can maximize the task reward while satisfying safety constraints (Brunke et al., 2021) , which has shown to be effective in learning a safe policy in many tasks (Zhao et al., 2021; Liu et al., 2022; Sootla et al., 2022b) . The success of recent safe RL approaches leverages the power of neural networks (Srinivasan et al., 2020; Thananjeyan et al., 2021) . However, it has been shown that neural networks are vulnerable to adversarial attacks -a small perturbation of the input data may lead to a large variance of the output (Machado et al., 2021; Pitropakis et al., 2019) , which raises a concern when deploying a neural network RL policy to safety-critical applications (Akhtar & Mian, 2018) . While many recent safe RL methods with deep policies can achieve outstanding constraint satisfaction in noise-free simulation environments, such a concern regarding their vulnerability under adversarial perturbations has not been studied in the safe RL setting. We consider the observational perturbations that commonly exist in the physical world, such as unavoidable sensor errors and upstream perception inaccuracy (Zhang et al., 2020a) . Several recent works of observational robust RL have shown that deep RL agent could be attacked via sophisticated observation perturbations, drastically decreasing their rewards (Huang et al., 2017; Zhang et al., 2021) . However, the robustness concept and adversarial training methods in standard RL settings may not be suitable for safe RL because of an additional metric that characterizes the cost of constraint violations (Brunke et al., 2021) . The cost should be more important than the measure of reward, since any constraint violations could be fatal and unacceptable in the real world (Berkenkamp et al., 2017) . For example, consider the autonomous vehicle navigation task where the reward is to reach the goal as fast as possible and the safety constraint is to not collide with obstacles, then sacrificing some reward is not comparable with violating the constraint because the latter may cause catastrophic consequences. However, we find little research formally studying the robustness in the safe RL setting with adversarial observation perturbations, while we believe this should be an important aspect in the safe RL area, because a vulnerable policy under adversarial attacks cannot be regarded as truly safe in the physical world. We aim to address the following questions in this work: 1) How vulnerable would a learned RL agent be under observational adversarial attacks? 2) How to design effective attackers in the safe RL setting? 3) How to obtain a robust policy that can maintain safety even under worst-case perturbations? To answer them, we formally define the observational robust safe RL problem and discuss how to evaluate the adversary and robustness of a safe RL policy. We also propose two strong adversarial attacks that can induce the agent to perform unsafe behaviors and show that adversarial training can help improve the robustness of constraint satisfaction. We summarize the contributions as follows. 1. We formally analyze the policy vulnerability in safe RL under observational corruptions, investigate the observational-adversarial safe RL problem, and show that the optimal solutions of safe RL problems are vulnerable under observational adversarial attacks. 2. We find that existing adversarial attacks focusing on minimizing agent rewards do not always work, and propose two effective attack algorithms with theoretical justifications -one directly maximizes the cost, and one maximizes the task reward to induce a tempting but risky policy. Surprisingly, the maximum reward attack is very strong in inducing unsafe behaviors, both in theory and practice. We believe this property is overlooked as maximizing reward is the optimization goal for standard RL, yet it leads to risky and stealthy attacks to safety constraints. 3. We propose an adversarial training algorithm with the proposed attackers and show contraction properties of their Bellman operators. Extensive experiments in continuous control tasks show that our method is more robust against adversarial perturbations in terms of constraint satisfaction.

2. RELATED WORK

Safe RL. One type of approach utilizes domain knowledge of the target problem to improve the safety of an RL agent, such as designing a safety filter (Dalal et al., 2018; Yu et al., 2022) , assuming sophisticated system dynamics model (Liu et al., 2020; Luo & Ma, 2021; Chen et al., 2021) , or incorporating expert interventions (Saunders et al., 2017; Alshiekh et al., 2018) . Constrained Markov Decision Process (CMDP) is a commonly used framework to model the safe RL problem, which can be solved via constrained optimization techniques (Garcıa & Fernández, 2015; Gu et al., 2022; Sootla et al., 2022a; Flet-Berliac & Basu, 2022) . The Lagrangian-based method is a generic constrained optimization algorithm to solve CMDP, which introduces additional Lagrange multipliers to penalize constraints violations (Bhatnagar & Lakshmanan, 2012; Chow et al., 2017; As et al., 2022) . The multiplier can be optimized via gradient descent together with the policy parameters (Liang et al., 2018; Tessler et al., 2018) , and can be easily incorporated into many existing RL methods (Ray et al., 2019) . Another line of work approximates the non-convex constrained optimization problem with low-order Taylor expansions and then obtains the dual variable via convex optimization (Yu et al., 2019; Yang et al., 2020; Gu et al., 2021; Kim & Oh, 2022) . Since the constrained optimization-based methods are more general, we will focus on the discussions of safe RL upon them. Robust RL. The robustness definition in the RL context has many interpretations (Sun et al., 2021; Moos et al., 2022; Korkmaz, 2023) , including the robustness against action perturbations (Tessler et al., 2019 ), reward corruptions (Wang et al., 2020; Lin et al., 2020; Eysenbach & Levine, 2021) , domain shift (Tobin et al., 2017; Muratore et al., 2018) , and dynamics uncertainty (Pinto et al., 2017; Huang et al., 2022) . The most related works are investigating the observational robustness of an RL agent under observational adversarial attacks (Zhang et al., 2020a; 2021; Liang et al., 2022; Korkmaz, 2022) . It has been shown that the neural network policies can be easily attacked by adversarial observation noise and thus lead to much lower rewards than the optimal policy (Huang et al., 2017; Kos & Song, 2017; Lin et al., 2017; Pattanaik et al., 2017) . However, most of the robust RL approaches model the attack and defense regarding the reward, while the robustness regarding safety, i.e., constraint satisfaction for safe RL, has not been formally investigated.

3. OBSERVATIONAL ADVERSARIAL ATTACK FOR SAFE RL

3.1 MDP, CMDP, AND THE SAFE RL PROBLEM An infinite horizon Markov Decision Process (MDP) is defined by the tuple (S, A, P, r, γ, µ 0 ), where S is the state space, A is the action space, P : S × A × S -→ [0, 1] is the transition kernel that specifies the transition probability p(s t+1 |s t , a t ) from state s t to s t+1 under the action a t , r : S × A × S -→ R is the reward function, γ -→ [0, 1) is the discount factor, and µ 0 : S -→ [0, 1] is the initial state distribution. We study safe RL under the Constrained MDP (CMDP) framework

