EXPLORING TRANSFERABILITY OF PERTURBATIONS IN DEEP REINFORCEMENT LEARNING

Abstract

The use of Deep Neural Networks (DNN) as function approximators has led to striking progress for reinforcement learning algorithms and applications. At the same time, deep reinforcement learning agents have inherited the vulnerability of DNNs to imperceptible adversarial perturbations to their inputs. Prior work on adversarial perturbations for deep reinforcement learning has generally relied on calculating an adversarial perturbation customized to each state visited by the agent. In this paper we propose a more realistic threat model in which the adversary computes the perturbation only once based on a single state. Furthermore, we show that to cause a deep reinforcement learning agent to fail it is enough to have only one adversarial offset vector in a black-box setting. We conduct experiments in various environments from the Atari baselines, and use our single-state adversaries to demonstrate the transferability of perturbations both between states of one MDP, and between entirely different MDPs. We believe our adversary framework reveals fundamental properties of the environments used in deep reinforcement learning training, and is a tangible step towards building robust and reliable deep reinforcement learning agents.

1. INTRODUCTION

Building on the success of DNNs for image classification, deep reinforcement learning has seen remarkable advances in various complex environments Mnih et al. (2015) ; Schulman et al. (2017) ; Lillicrap et al. (2015) . Along with these successes come new challenges stemming from the lack of robustness of DNNs to small adversarial perturbations to inputs. This lack of robustness is especially critical for deep reinforcement learning, where the actions taken by the agent can have serious real-life consequences (Levin & Carrie (2018) ). Szegedy et al. (2014) showed that imperceptible perturbations added to images could be used to cause a DNN image classifier to misclassify, and hypothesized that this was a result of the DNN's complexity and nonlinearities. (2017) . A major drawback of this prior work is the threat model, in which the attacker computes perturbations and adds them to the agent's state in real-time before the state is perceived by the agent. Such a threat model is not practically



Follow-up work by Goodfellow et al. (2015) proposed a more computationally efficient fast gradient sign method (FGSM) for computing adversarial examples, while simultaneously explaining the presence of adversarial examples as a result of DNN classifiers learning approximately linear functions. The authors additionally argue that the presence of a common high-dimensional linear decision boundary is an explanation for the transferability of adversarial examples between models trained on different minibatches of the same training set or with different architecture. The transferability of perturbations themselves, in contrast to transferability of complete adversarial examples, was investigated by Moosavi-Dezfooli et al. (2017). The authors showed how to compute one universal perturbation from a minibatch of images that could be used to fool a classifier by adding it to each image in the validation set. The authors argue the existence of universal perturbations is explained by correlations between the local decision boundaries near many different points in the dataset. More recently, a new body of research on adversarial perturbations for deep reinforcement learning agents has developed, demonstrating that it is possible to make a high-performing agent fail by adding perturbations to the agent's perception of the current state Lin et al. (2017); Sun et al. (2020); Huang et al. (2017); Pattanaik et al. (2018); Mandlekar et al.

