EFFICIENT REWARD POISONING ATTACKS ON ONLINE DEEP REINFORCEMENT LEARNING

Abstract

We study reward poisoning attacks on online deep reinforcement learning (DRL), where the attacker is oblivious to the learning algorithm used by the agent and does not necessarily have full knowledge of the environment. We demonstrate the intrinsic vulnerability of state-of-the-art DRL algorithms by designing a general, black-box reward poisoning framework called adversarial MDP attacks. We instantiate our framework to construct several new attacks which only corrupt the rewards for a small fraction of the total training timesteps and make the agent learn a low-performing policy. Our key insight is that state-of-the-art DRL algorithms strategically explore the environment to find a high-performing policy. Our attacks leverage this insight to construct a corrupted environment where (a) the agent learns a high-performing policy that has low performance in the original environment and (b) the corrupted environment is similar to the original one so that the attacker's budget is reduced. We provide a theoretical analysis of the efficiency of our attack and perform an extensive evaluation. Our results show that our attacks efficiently poison agents learning with a variety of state-of-the-art DRL algorithms, such as DQN, PPO, SAC, etc., under several popular classical control and MuJoCo environments.

1. INTRODUCTION

In several important applications such as robot control (Christiano et al., 2017) and recommendation systems (Afsar et al., 2021; Zheng et al., 2018) , state-of-the-art online deep reinforcement learning (DRL) algorithms rely on human feedbacks in terms of rewards, for learning high-performing policies. This dependency raises the threat of reward-based data poisoning attacks during training: a user can deliberately provide malicious rewards to make the DRL agent learn low-performing policies. Data poisoning has already been identified as the most critical security concern when employing learned models in industry (Kumar et al., 2020) . Thus, it is essential to study whether state-ofthe-art DRL algorithms are vulnerable to reward poisoning attacks to discover potential security vulnerabilities and motivate the development of more robust training algorithms. Challenges in poisoning DRL agents. To uncover practical vulnerabilities, it is critical that the attack does not rely on unrealistic assumptions about the attacker's capabilities. Therefore for ensuring a practically feasible attack, we require that: (i) the attacker has no knowledge of the exact DRL algorithm used by the agent as well as the parameters of the neural network used for training. Further, it should be applicable to different kinds of learning algorithms (e.g., policy optimization, Q learning) (ii) the attacker does not have detailed knowledge about the agent's environment, and (iii) to ensure stealthy, the amount of reward corruption applied by the attacker is limited (see Section 3). As we show in Appendix G, these restrictions make finding an efficient attack very challenging. This work: efficient poisoning attacks on DRL. To the best of our knowledge, no prior work studies the vulnerability of the DRL algorithms to reward poisoning attacks under the practical restrictions mentioned above. To overcome the challenges in designing efficient attacks and demonstrate the vulnerability of the state-of-the-art DRL algorithms, we make the following contributions: 1. We propose a general, efficient, and parametric reward poisoning framework for DRL algorithms, which we call adversarial MDP attack, and instantiate it to generate several attack methods that are applicable to any kind of learning algorithms and computationally efficient. To the best of our knowledge, our attack is the first one that considers the following four key elements in the threat model at the same time: 1. Training time attack, 2. Deep RL, 3. Reward poisoning attack, 4. Complete black box attack (no knowledge or assumption about the learning algorithm and the environment). A detailed explanation for each key point is provided in Appendix A. 2. We provide a theoretical analysis of our attack methods based on certain assumptions on the efficiency of the DRL algorithms which yields several insightful implications. 3. We provide an extensive evaluation of our attack methods for poisoning the training with several state-of-the-art DRL algorithms, such as DQN, PPO, SAC, etc., in the classical control and MuJoCo environments, commonly used for developing and testing DRL algorithms. Our results show that our attack methods significantly reduce the performance of the policy learned by the agent in the majority of the cases and are considerably more efficient than baseline attacks (e.g., VA2C-P (Sun et al., 2020), reward-flipping (Zhang et al., 2021b)). We further validate the implications of our theoretical analysis by observing the corresponding phenomena in experiments. Observation perturbation attack and defense. There is a line of work studying observation perturbation attacks during training time (Behzadan & Munir, 2017a; b; Inkawhich et al., 2019) and the corresponding defense (Zhang et al., 2021a; 2020a) . The threat model here does not change the actual state or reward of the environment, but instead, it changes the learner's observation of the environment by generating adversarial examples. In contrast, for the poisoning attack as considered in our work, the actual reward or state of the environment is changed by the attacker. The observation perturbation attack assumes access to perturb the sensor of the agent that is used to observe the environment. Therefore, it is not practical when the attacker does not have access to the agent's sensor, or the agent does not rely on sensors for interacting with the environment.

2. RELATED WORK

Data poisoning attack on DRL. The work of Sun et al. ( 2020) is the only other work that considers reward poisoning attack on DRL and therefore is the closest to ours. There are three main limitations of their attack compared to ours (a) the attack requires the knowledge of the learning algorithm (the update rule for learned policies) used by the agent, which is not the complete black box setting, (b) the attack only works for on-policy learning algorithms, and (c) the attacker in their setting makes the decision about attacking after receiving a whole training batch. This makes the attack infeasible when the agent updates the observation at each time step, as in this case it is impossible for the attacker to apply corruption to previous observations in a training batch. We experimentally compare against them by adapting our general attacks to their restricted setting. Our results in Appendix I show that our attack requires much less computational resources and achieves better attack results. Robust learning algorithms against data poisoning attack. Robust learning algorithms can guarantee efficient learning under the data poisoning attack. There have been studies on robustness in the bandit (Lykouris et al., 2018; Gupta et al., 2019) , and tabular MDP settings (Chen et al., 2021; Wu et al., 2021; Lykouris et al., 2021) , but these results are not applicable in the more complex DRL setting. For the DRL setting, Zhang et al. (2021b) proposes a learning algorithm guaranteed to be robust in a simplified DRL setting under strong assumptions on the environment (e.g., linear Q function and finite action space). The algorithm is further empirically tested in actual DRL settings, but the attack method used for testing robustness, which we call reward flipping attack, is not very efficient and malicious as we show in Appendix H. Testing against weak attack methods can provide a false sense of security. Our work provides attack methods that are more suitable for empirically measuring the robustness of learning algorithms.



Testing time attack on RL. Testing time attack (evasion attack) in deep RL setting is popular in literature(Huang et al., 2017; Kos & Song, 2017; Lin et al., 2017). For an already trained policy, testing time attacks find adversarial examples where the learned policy has undesired behavior. In contrast, our training time attack corrupts reward to make the agent learn low-performing policies.

