EXPLORING TRANSFERABILITY OF PERTURBATIONS IN DEEP REINFORCEMENT LEARNING

Abstract

The use of Deep Neural Networks (DNN) as function approximators has led to striking progress for reinforcement learning algorithms and applications. At the same time, deep reinforcement learning agents have inherited the vulnerability of DNNs to imperceptible adversarial perturbations to their inputs. Prior work on adversarial perturbations for deep reinforcement learning has generally relied on calculating an adversarial perturbation customized to each state visited by the agent. In this paper we propose a more realistic threat model in which the adversary computes the perturbation only once based on a single state. Furthermore, we show that to cause a deep reinforcement learning agent to fail it is enough to have only one adversarial offset vector in a black-box setting. We conduct experiments in various environments from the Atari baselines, and use our single-state adversaries to demonstrate the transferability of perturbations both between states of one MDP, and between entirely different MDPs. We believe our adversary framework reveals fundamental properties of the environments used in deep reinforcement learning training, and is a tangible step towards building robust and reliable deep reinforcement learning agents.

1. INTRODUCTION

Building on the success of DNNs for image classification, deep reinforcement learning has seen remarkable advances in various complex environments Mnih et al. (2015) ; Schulman et al. (2017) ; Lillicrap et al. (2015) . Along with these successes come new challenges stemming from the lack of robustness of DNNs to small adversarial perturbations to inputs. This lack of robustness is especially critical for deep reinforcement learning, where the actions taken by the agent can have serious real-life consequences (Levin & Carrie (2018) ). Szegedy et al. (2014) showed that imperceptible perturbations added to images could be used to cause a DNN image classifier to misclassify, and hypothesized that this was a result of the DNN's complexity and nonlinearities. Follow-up work by Goodfellow et al. (2015) proposed a more computationally efficient fast gradient sign method (FGSM) for computing adversarial examples, while simultaneously explaining the presence of adversarial examples as a result of DNN classifiers learning approximately linear functions. The authors additionally argue that the presence of a common high-dimensional linear decision boundary is an explanation for the transferability of adversarial examples between models trained on different minibatches of the same training set or with different architecture. The transferability of perturbations themselves, in contrast to transferability of complete adversarial examples, was investigated by Moosavi-Dezfooli et al. (2017) . The authors showed how to compute one universal perturbation from a minibatch of images that could be used to fool a classifier by adding it to each image in the validation set. The authors argue the existence of universal perturbations is explained by correlations between the local decision boundaries near many different points in the dataset.

The initial work on adversarial examples in deep learning by

More recently, a new body of research on adversarial perturbations for deep reinforcement learning agents has developed, demonstrating that it is possible to make a high-performing agent fail by adding perturbations to the agent's perception of the current state Lin et al. ( 2017 To address this we propose a new threat model where the attacker computes just one adversarial offset vector based on a single state. Not only is this a more practical threat model from a security perspective, but the success of attacks in this model can be used to give insight into the transferability of adversarial examples and the properties of functions learned by trained deep reinforcement learning agents. We believe the adversarial perspective is an initial step towards understanding the loss landscape and geometry of the decision boundaries of the algorithms in use, assessing the generalization capabilities of trained agents, and building robust and reliable deep reinforcement learning agents. For these reasons in our paper we focus on the adversarial perspective and make the following contributions: • We introduce a framework of six unique adversary types encapsulating our new threat model where the attacker is restricted to computing an adversarial perturbation based on a single state. • We use our framework to investigate the transferability of adversarial perturbations, both between states of the same MDP and between MDPs. • We show with experiments in the Atari baselines that single state attacks have significant impact on agent performance. • In particular, we show that by using only one perturbation computed based on a random state of a random episode from a completely different environment, it is possible to have a dramatic impact on the deep reinforcement learning agent's performance. This transferability of adversarial perturbations between environments implies that there is a correlation between the features learned by agents trained in different environments. While these results demonstrate that deep reinforcement learning agents are indeed learning representations that generalize across environments, they also demonstrate a critical vulnerability from the security point of view that can be exploited without any knowledge of the training environment or neural architecture.

2.1. DEEP REINFORCEMENT LEARNING

In this paper we examine discrete action space MDPs that are represented by a tuple: M = (S, A, P, r, γ, s 0 ) where S is a set of continous states, A is a set of discrete actions, P : S × A × S → R is the transition probability, r : S × A → R is the reward function, γ is the discount factor, and s 0 is the initial state distribution. The agent interacts with the environment by observing s ∈ S and taking actions a ∈ A. The goal is to learn a policy π θ : S × A → R which takes an action a in state s that maximizes the cumulative discounted reward T -1 t=0 γ t r(s t , a t ). For an MDP M and policy π(s, a) we call a sequence of state, action, reward tuples, (s i , a i , r i ), that occurs when utilizing π(s, a) in M an episode. We use p M,π to denote the probability distribution over the episodes generated by the randomness in M and the policy π.

2.2. ADVERSARIAL PERTURBATION METHODS

In this work we create the adversarial perturbations via two different methods. The first is the Carlini & Wagner (2017) formulation. In the deep reinforcement learning setup this formulation is, (2)



); Sun et al. (2020); Huang et al. (2017); Pattanaik et al. (2018); Mandlekar et al. (2017). A major drawback of this prior work is the threat model, in which the attacker computes perturbations and adds them to the agent's state in real-time before the state is perceived by the agent. Such a threat model is not practically feasible considering the heavy duty optimization methods used by Lin et al. (2017); Sun et al. (2020).

s is the unperturbed input, s adv is the adversarially perturbed input, and J(s) is the augmented cost function used to train the network. The second method we use to produce the adversarial examples is the elastic net regularization (EAD) Chen et al. (2018) adversarial formulation,min sadv∈S c • J(s adv ) + λ  s adv -s  + λ  s adv -s  

