EFFICIENT REWARD POISONING ATTACKS ON ONLINE DEEP REINFORCEMENT LEARNING

Abstract

We study reward poisoning attacks on online deep reinforcement learning (DRL), where the attacker is oblivious to the learning algorithm used by the agent and does not necessarily have full knowledge of the environment. We demonstrate the intrinsic vulnerability of state-of-the-art DRL algorithms by designing a general, black-box reward poisoning framework called adversarial MDP attacks. We instantiate our framework to construct several new attacks which only corrupt the rewards for a small fraction of the total training timesteps and make the agent learn a low-performing policy. Our key insight is that state-of-the-art DRL algorithms strategically explore the environment to find a high-performing policy. Our attacks leverage this insight to construct a corrupted environment where (a) the agent learns a high-performing policy that has low performance in the original environment and (b) the corrupted environment is similar to the original one so that the attacker's budget is reduced. We provide a theoretical analysis of the efficiency of our attack and perform an extensive evaluation. Our results show that our attacks efficiently poison agents learning with a variety of state-of-the-art DRL algorithms, such as DQN, PPO, SAC, etc., under several popular classical control and MuJoCo environments.

1. INTRODUCTION

In several important applications such as robot control (Christiano et al., 2017) and recommendation systems (Afsar et al., 2021; Zheng et al., 2018) , state-of-the-art online deep reinforcement learning (DRL) algorithms rely on human feedbacks in terms of rewards, for learning high-performing policies. This dependency raises the threat of reward-based data poisoning attacks during training: a user can deliberately provide malicious rewards to make the DRL agent learn low-performing policies. Data poisoning has already been identified as the most critical security concern when employing learned models in industry (Kumar et al., 2020) . Thus, it is essential to study whether state-ofthe-art DRL algorithms are vulnerable to reward poisoning attacks to discover potential security vulnerabilities and motivate the development of more robust training algorithms. Challenges in poisoning DRL agents. To uncover practical vulnerabilities, it is critical that the attack does not rely on unrealistic assumptions about the attacker's capabilities. Therefore for ensuring a practically feasible attack, we require that: (i) the attacker has no knowledge of the exact DRL algorithm used by the agent as well as the parameters of the neural network used for training. Further, it should be applicable to different kinds of learning algorithms (e.g., policy optimization, Q learning) (ii) the attacker does not have detailed knowledge about the agent's environment, and (iii) to ensure stealthy, the amount of reward corruption applied by the attacker is limited (see Section 3). As we show in Appendix G, these restrictions make finding an efficient attack very challenging. This work: efficient poisoning attacks on DRL. To the best of our knowledge, no prior work studies the vulnerability of the DRL algorithms to reward poisoning attacks under the practical restrictions mentioned above. To overcome the challenges in designing efficient attacks and demonstrate the vulnerability of the state-of-the-art DRL algorithms, we make the following contributions: 1. We propose a general, efficient, and parametric reward poisoning framework for DRL algorithms, which we call adversarial MDP attack, and instantiate it to generate several attack methods that are applicable to any kind of learning algorithms and computationally efficient. To the best of our knowledge, our attack is the first one that considers the following four key elements in the

