BOUNDED MYOPIC ADVERSARIES FOR DEEP REINFORCEMENT LEARNING AGENTS

Abstract

Adversarial attacks against deep neural networks have been widely studied. Adversarial examples for deep reinforcement learning (DeepRL) have significant security implications, due to the deployment of these algorithms in many application domains. In this work we formalize an optimal myopic adversary for deep reinforcement learning agents. Our adversary attempts to find a bounded perturbation of the state which minimizes the value of the action taken by the agent. We show with experiments in various games in the Atari environment that our attack formulation achieves significantly larger impact as compared to the current state-of-the-art. Furthermore, this enables us to lower the bounds by several orders of magnitude on the perturbation needed to efficiently achieve significant impacts on DeepRL agents.

1. INTRODUCTION

Deep Neural Networks (DNN) have become a powerful tool and currently DNNs are widely used in speech recognition (Hannun et al., 2014) , computer vision (Krizhevsky et al., 2012) , natural language processing (Sutskever et al., 2014) , and self learning systems as deep reinforcement learning agents (Mnih et al. (2015) , Mnih et al. (2016) , Schulman et al. (2015) , Lillicrap et al. (2015) ). Along with the overwhelming success of DNNs in various domains there has also been a line of research investigating their weaknesses. Szegedy et al. (2014) observed that adding imperceptible perturbations to images can lead a DNN to misclassify the input image. The authors argue that the existence of these so called adversarial examples is a form of overfitting. In particular, they hypothesize that a very complicated neural network behaves well on the training set, but nonetheless, performs poorly on the testing set enabling exploitation by the attacker. However, they discovered different DNN models were misclassifying the same adversarial examples and assigning them the same class instead of making random mistakes. This led Goodfellow et al. (2015) to propose that the DNN models were actually learning approximately linear functions resulting in underfitting the data. Mnih et al. (2015) introduced the use of DNNs as function approximators in reinforcement learning, improving the state of the art in this area. Because these deep reinforcement learning agents utilize DNNs, they are also susceptible to this type of adversarial examples. Currently, deep reinforcement learning has been applied to many areas such as network system control (Jay et al. (2019) , Chu et al. (2020) , Chinchali et al. (2018) ), financial trading Noonan (2017) , blockchain protocol security Hou et al. (2019) , grid operation and security (Duan et al. (2019) , Huang et al. (2019) ), cloud computing Chen et al. (2018) , robotics (Gu et al. (2017) , Kalashnikov et al. (2018) ), autonomous driving Dosovitsky et al. (2017) , and medical treatment and diagnosis (Tseng et al. (2017) , Popova et al. (2018) , Thananjeyan et al. (2017) , Daochang & Jiang (2018) , Ghesu et al. (2017) ). A more particular scenario where adversarial perturbations might be of significant interest is a financial trading market where the DeepRL agent is trained on observations consisting of the order book. In such a setting it is possible to compromise the whole trading system with an extremely small subset of adversaries. In particular, the 1 -norm bounded perturbations dicussed in our paper have sparse solutions, and thus can be used as a basis for an attack in such a scenario. Moreover, the magnitude of the 1 -norm bounded perturbations produced by our attack is orders of magnitude smaller than previous approaches, and thus our proposed perturbations result in a stealth attack more likely to evade automatic anomaly detection schemes.

Recent work by

Considering the wide spectrum of deep reinforcement learning algorithm deployment it is crucial to investigate the resilience of these algorithms before they are used in real world application domains. Moreover, adversarial formulations are a first step to understand these algorithms and build generalizable, reliable and robust deep reinforcement learning agents. Therefore, in this paper we study adversarial attack formulations for deep reinforcement learning agents and make the following contributions: • We define the optimal myopic adversary, whose aim is to minimize the value of the action taken by the agent in each state, and formulate the optimization problem that this adversary seeks to solve. • We introduce a differentiable approximation for the optimal myopic adversarial formulation. • We compare the impact results of our attack formulation to previous formulations in different games in the Atari environment. • We show that the new formulation finds a better direction for the adversarial perturbation and increases the attack impact for bounded perturbations. (Conversely, our formulation decreases the magnitude of the pertubation required to efficiently achieve a significant impact.)

2. RELATED WORK AND BACKGROUND

2.1 ADVERSARIAL REINFORCEMENT LEARNING Adversarial reinforcement learning is an active line of research directed towards discovering the weaknesses of deep reinforcement learning algorithms. Gleave et al. (2020) model the interaction between the agent and the adversary as a two player Markov game and solve the reinforcement learning problem for the adversary via Proximal Policy Optimization introduced by Schulman et al. (2017) . They fix the victim agent's policy and only allow the adversary to take natural actions to disrupt the agent instead of using p -norm bound pixel perturbations. Pinto et al. (2017) model the adversary and the victim as a two player zero-sum discounted Markov game and train the victim in the presence of the adversary to make the victim more robust. Mandlekar et al. (2017) use a gradient based perturbation to make the agent more robust as compared to random perturbations. Huang et al. (2017) and Kos & Song (2017) use the fast gradient sign method (FGSM) to show deep reinforcement learning agents are vulnarable to adversarial perturbations. Pattanaik et al. (2018) use a gradient based formulation to increase the robustness of deep reinforcement learning agents. 2.2 ADVERSARIAL ATTACK METHODS Goodfellow et al. (2015) introduced the fast gradient method (FGM) x * = x + • ∇ x J(x, y) ||∇ x J(x, y)|| p , for crafting adversarial examples for image classification by taking the gradient of the cost function J(x, y) used to train the neural network in the direction of the input. Here x is the input and y is the output label for image classification. As mentioned in the previous section FGM was first adapted to the deep reinforcement learning setting by Huang et al. (2017) . Subsequently, Pattanaik et al. (2018) introduced a variant of FGM, in which a few random samples are taken in the gradient direction, and the best is chosen. However, the main difference between the approach of Huang et al. (2017) and Pattanaik et al. (2018) is in the choice of the cost function J used to determine the gradient direction. In the next section we will outline the different cost functions used in these two different formulations.

2.3. ADVERSARIAL ATTACK FORMULATIONS

In a bounded attack formulation for deep reinforcement learning, the aim is to try to find a perturbed state s adv in a ball D ,p (s) = {s adv s adv -s p ≤ }, that minimizes the expected cumulative reward of the agent. It is important to note that the agent will always try to take the best action depending only on its perception of the state, and independent from the unperturbed state. Therefore, in the perturbed state the agent will still choose the action, a * (s adv ) = arg max a Q(s adv , a), which maximizes the state action value function in state s adv . It has been an active line of research to find the right direction for the adversarial perturbation in the deep reinforcement learning domain. The first attack of this form was formulated by Huang et al. (2017) and concurrently by Kos & Song (2017) by trying to minimize the probability of the best possible action in the given state, a * (s) = arg max a Q(s, a) min sadv∈D ,p π(s adv , a * (s)). (3) Note that π(s, a) is the softmax policy of the agent given by π T (s, a) = e Q(s,a(s)) T a k e Q(s,a k ) T , ( ) where T is called the temperature constant. When the temperature constant is not relevant to the discussion we will drop the subscript and use the notation π(s, a). It is important to note that π(s, a) is not the actual policy used by the agent. Indeed the DRL agent deterministically chooses the action a * maximizing Q(s, a) (or equivalently π(s, a)). The softmax operation is only introduced in order to calculate the adversarial perturbation direction.  s adv -s p subject to a * (s) = a * (s adv ). Pattanaik et al. ( 2018) formulated yet another attack which aims to maximize the probability of the worst possible action in the given state, a w (s) = arg min a Q(s, a) max sadv∈D ,p (s) π(s adv , a w (s)), and further showed that their attack formulation ( 5) is more effective than (3). Pattanaik et al. (2018) also introduce the notion of targeted attacks to the reinforcement learning domain. In their paper they take the cross entropy loss between the optimal policy in the given state and their adversarial probability distribution and try to increase the probability of a w . However, just trying to increase the probability of a w in the softmax policy i.e. π(s adv , a w ) is not sufficient to target a w in the actual policy followed by the agent. In fact the agent can end up in a state where, π(s adv , a w ) > π(s, a w ) a w = arg max a π(s adv , a). Although π(s adv , a w ) has increased, the action a w will not be taken. However, it might be still possible to find a perturbed state s adv for which, π(s adv , a w ) > π(s adv , a w ) a w = arg max a π(s adv , a). Therefore, maximizing the probability of taking the worst possible action in the given state is not actually the right formulation to find the correct direction for adversarial perturbation.

3. OPTIMAL MYOPIC ADVERSARIAL FORMULATION

To address the problem described in Section 2.3 we define an optimal myopic adversary to be an adversary which aims to minimize the value of the action taken by the agent myopically for each state. The value of the action chosen by the agent in the unperturbed state is Q(s, a * (s)), and the value (measured in the unperturbed state) of the action chosen by the agent under the influence of the adversarial observation is Q(s, a * (s adv )). The difference between these is the actual impact of the attack. Therefore, in each state s the optimal myopic adversary must solve the following optimization problem arg max sadv∈D ,p (s) [Q(s, a * (s)) -Q(s, a * (s adv ))]. By ( 4), we may rewrite (8) in terms of the softmax policies π(s, a), arg max sadv∈D ,p (s) [π(s, a * (s)) -π(s, a * (s adv ))]. Since π(s, a * (s)) does not depend on s adv , ( 9) is equivalent to solving  The intuition for ( 11) is that, by (3), as we decrease the temperature the value of e (Q(s,a * )/Tadv) for the action a * which maximizes Q(s, a) will dominate the other actions. Thus π(s, arg max a {π(s adv , a)}) can be approximated by, π(s, arg max a {π(s adv , a)}) = a π(s, a) • 1 arg max a {π(sadv,a )} (a) = lim In practice we will not be decreasing T adv to 0 as this would be equal to applying the non-differentiable argmax operation. Instead we will replace the arg max operation with the approximation π Tadv (s, a) for a small value of T adv . In general it is not guarenteed that the minimum of the cost function in ( 14) with this approximation will be close to the minimum of the original cost function with the arg max operation. To see why, note that this approximation is equivalent to first switching the limit and minimum in ( 14) to obtain, lim Tadv→0 min sadv∈D ,p (s) a π(s, a) • π Tadv (s adv , a), and second replacing the limit with a small value of T adv . There are two possible issues with this approach. First of all, due to the non-convexity of the minimand, exchanging the limit and minimum may not yield an equality between ( 14) and ( 15). Secondly, even if this exchange is legitimate, it is not clear how to choose a sufficiently small value for T adv in order to obtain a good approximation to (15). However, we show in our experiments that using this approximation in the cost function gives state-of-the-art results. There is another caveat which applies to all myopic adversaries including those in prior work. The Q-value is a good estimate of the discounted future rewards of the agent, only assuming that the agent continues to take the action maximizing the Q-value in future states. Since myopic attacks are applied in each state, this may make minimizing the Q-value a non-optimal attack strategy when the future is taken into account. There is also always the risk that the agent's Q-value is miscalibrated. However, our experiments show that our myopic formulation performs well despite these potential limitations.

3.2. EXPERIMENTAL SETUP

In our experiments we averaged over 10 episodes for each Atari game (Bellemare et al., 2013) in Figure 1 from the Open AI gym environment (Brockman et al., 2016) . Agents are trained with Double DQN (Wang et al., 2016) . We compared the attack impact of our Myopic formulation with the previous formulations of Huang et al. (2017) and Pattanaik et al. (2018) . The attack impact has been normalized by comparing an unattacked agent, which chooses the action corresponding to the maximum state-action value in each state, with an agent that chooses the action corresponding to the minimum state-action value in each state. Formally, let R max be the average return for the agent who always chooses the best action in a given state, let R min be the average return for the agent who always chooses the worst possible action in a given state, and let R a be the average return of the agent under attack. We define the impact I, as I = R max -R a R max -R min . ( ) This normalization was chosen because we observed that in Atari environments agents can still collect stochastic rewards even when choosing the worst possible action in each state until the game ends. See more details of the setup in Appendix A.2.

3.3. ADVERSARIAL TEMPERATURE

Based on the discussion in Section 3 we expect that as T adv decreases the function in ( 14) becomes a better approximation of the arg max function. Thus this results in a higher attack impact as T adv decreases. However, after a certain threshold the function in ( 14) becomes too close to the arg max function which is non-differentiable. Therefore, in practice, beyond this threshold the quality of the solutions given by the gradient based optimization will decrease, and so the attack impact will be lower. Indeed this can be observed in Figure 2 . In our experiments we chose T adv to maximize the impact by grid search.

3.4. IMPACT COMPARISON FOR p -NORM BOUNDED PERTURBATIONS

Table 1 shows the mean and standard deviation of the impact values under 1 and 2 -norm bounds. The tables show that the proposed attack results in higher mean impact for all games and under all norms, and it results in a lower standard deviation almost always. In particular, this is an indication that our myopic attack formulation achieves higher impact more consistently than the previous formulations. More results on ∞ -norm bound can be found in Section A.1 of the Appendix. Figure 3 shows the attack impact as a function of the perturbation bound for each formulation in three games. As decreases, our myopic formulation exhibits higher impact relative to the other formulations. Recall that Goodfellow et al. (2015) argue that small adversarial perturbations shift the input across approximately linear decision boundaries learned by neural networks. Therefore, having a higher impact with smaller norm bound is evidence that our myopic formulation finds a better direction (i.e. one that points more directly at such a decision boundary) to search for the perturbation. 

3.5. DISTRIBUTION ON ACTIONS TAKEN

In order to understand the superior performance of the proposed adversarial formulation it is worthwhile to analyze the distribution of the actions taken by the agents under attack. For each formulation, we recorded the empirical probability p(s, a k ) that the attacked agent chooses the k th ranked action a k . The ranking of the actions is according to their value in the unperturbed state. Without perturbation agents would always take their first ranked action, any deviation from this is a consequence of the adversarial perturbations. We average these values over 10 episodes to obtain the average empirical probability that the k th action is taken E e∼ρ [p(s, a k )]. Here ρ is the distribution on the episodes e of the game induced by the stochastic nature of the Atari games. We plot the results in Figure 4 . The legends in Figure 4 show the average empirical probabilities of taking the best action a * . It can be seen that in general E e∼ρ [p(s, a * )] is lower and E e∼ρ [p(s, a w )] is higher for our Myopic formulation when compared to other attack formulations. We realized that the formulations which have higher impact at the end of the game might still be choosing the best action more often than the ones which have lower attack impact as shown in the legend of Figure 4 for the Pong game. In this case both Huang et al. (2017) and Pattanaik et al. (2018) cause the agent to choose the best action a * less frequently than our formulation, but still end up with lower impact. If we look at Figure 4 it can be seen that E e∼ρ [p(s, a 2 nd )] is much higher for both Huang et al. (2017) and Pattanaik et al. ( 2018) compared to our Myopic formulation. However, the key to achieving a greater impact is to cause the agent to choose the lowest ranked actions more frequently. As can be seen from Figure 4 , our Myopic formulation does this more successfully. More detailed results can be found in Appendix A.5.

3.6. STATE-ACTION VALUES OVER TIME

In this section we investigate state-action values of the agents over time without an attack, under attack with the Pattanaik et al. (2018) formulation, and under attack with our myopic formulation. In Table 2 it is interesting to observe that under attack the mean of the state-action values over the episodes might be higher despite the average return being lower. One might think that an attack with greater impact might visit lower valued states on average. However, in our experiments we have found that the gap between E e∼ρ [Q(s, a * (s))] and E e∼ρ [Q(s, a * (s adv )] relative to the gap Table 2: Average Q-values for the best, worst, the adversarial actions and impacts, loss in the Q-values caused by the adversarial influence, and impacts over the episodes.  E[Q(s, a * (s))] E[Q(s, a * (sADV)] E[Q(s, [Q(s, a * (s))] and E e∼ρ [Q(s, a w ) ] is a much more significant factor in determining the magnitude of the impact. Therefore, we define the quantity Q loss = E e∼ρ [Q(s, a * (s))] -E e∼ρ [Q(s, a * (s adv )] E e∼ρ [Q(s, a * (s))] -E e∼ρ [Q(s, a w )] . ( ) Observe in Table 2 the magnitudes of E e∼ρ [Q(s, a * (s))], E e∼ρ [Q(s, a w )] , and E e∼ρ [Q(s, a * (s adv )] are not strictly correlated to the impacts for each formulation. However, Q loss is always higher for our Myopic attack compared to the previous formulations. Recall that in Equation (3), our original optimization problem was designed to maximize the gap between Q(s, a * (s)) and Q(s, a * (s adv )) in each state. The results in Table 2 are evidence both that we achieve this original goal and that solving our initial optimization problem for each state leads to lower average return at the end of the game. More results on this matter can be found in Appendix A.4. The mean Q-value per episode under attack increases in some games. We believe that this may be due to the fact that the Q-value in a given state is only an accurate representation of the expected rewards of an unattacked agent. When the agent is under attack, there might be states with high Q-value which are dangerous for the attacked agent (e.g. states where the attack causes the agent to immediately lose the game). More results can be found in Appendix A.3.

4. FUTURE EXTENSIONS TO CONTINUOUS ACTION SETS

We note in this section that, at least from a mathematical point of view, it is possible to extend our formulation to continuous control tasks. Observe that the adversarial objective in (8) applies to continuous control problems too. The difficulty we are describing in (10) is not an issue in continuous control tasks. For instance, in Deep Deterministic Policy Gradient (DDPG) or Proximal Policy Optimization (PPO) the adversarial policy is already approximated by the actor network µ θ (s). That is: a * (s adv ) = µ θ (s adv ). Unlike in the case of discrete action sets, this means that the solution to the problem in (8) can be approximated through gradient descent following, ∇(Q(s, µ θ (s adv )) = ∂Q(s, a) ∂a | a=µ θ (sadv) • ∇(µ θ (s adv )), where the gradient is taken with respect to s adv . In a way it is actually easier to construct the adversarial examples in continuous control tasks because the learning algorithm already produces a differentiable approximation µ θ (s) to the argmax operation used in action selection. In our paper we focused on the derivation for the case of a discrete action set because the optimization problem in (10) is harder to solve.

5. CONCLUSION

In this paper we studied formulations of adversarial attacks on deep reinforcement learning agents and defined the optimal myopic adversary which incorporates the action taken in the adversarial state into the cost function. By introducing a differentiable approximation to the value of the action taken by the agent under the influence of the adversary we find a direction for adversarial perturbation which more effectively decreases the Q-value of the agent in the given state. In our experiments we demonstrated the efficacy of our formulation as compared to the previous formulations in the Atari environment for various games. Adversarial formulations are inital steps towards building resilient and reliable DeepRL agents, and we believe our adversarial formulation can help to set a new baseline towards the robustification of DeepRL algorithms.

A APPENDIX

A.1 ∞ -NORM BOUND One observation from these results is that the myopic formulation performs worse in the ∞ -norm bound than it does in the 1 -norm bound and the 2 -norm bound when = 10 -8 . A priori one might expect the performance of the myopic formulation to best in the ∞ -norm bound because the unit ∞ ball contains the unit balls of the 1 and 2 norms. To investigate this further we examined the behaviour of 2 -norm bound and ∞ -norm bound while varying in Figure ( 5). Observe that at very small scales the 2 -norm bounded perturbation has larger impact, but at larger scales the ∞ -norm bounded perturbation generally performs better than the 2 -norm bounded perturbation. We link this phenomenon to the decision boundary behaviour in different scales. Recall that in the 2 -norm bound we use a perturbation in the gradient direction of length . However, in the ∞ -norm bound we use a perturbation given by times the sign of the gradient. This corresponds to the ∞ -norm bounded perturbation which causes the maximum possible change for a linear function. Compared to previous formulations decreasing the temperature in the softmax makes our objective function more nonlinear. Thus, at smaller scales it is crucial for the perturbation to be in the exact direction of the gradient, since the decision boundary behaves nonlinearly. However, at larger scales the decision boundary is approximately linear, which gives a better result in the ∞ -norm bound.

A.2 RANDOMIZED ITERATIVE SEARCH AND IMPACTS

It can be seen from Figure 6 that the impact of the Pattanaik et al. (2018) formulation increases as n increases, while our myopic formulation has a greater impact even when n is equal to 1. In particular, this is again an indication that our cost function provides a better direction in which to search for adversarial perturbations as compared to previous formulations. 4 we set n = 1 and compare the impact of the three different formulations. Even in this more restrictive setting the performance of our formulation remains high, while the performance of Pattanaik et al. (2018) degrades significantly. In particular, this demonstrates that the gradient of our cost function is a better direction to find an adversarial perturbation. 

A.3 GAMES AND AGENT BEHAVIOUR

In this section we share our observations on the behaviour of the trained agent under attack. In Figure 9 the agent performs well until it suddenly decides to stand still and wait for the enemy to arrive. Similarly, in Figure 7 the trained agent performs quite well again until it decides to jump in front of the truck. Finally, in Figure 8 the trained agent forgets to recharge its oxygen even though it is earning many points from shooting the fishes and saving the divers. 



min sadv∈D ,p (s) π(s, arg max a {π(s adv , a)}). (10) 3.1 EFFICIENT APPROXIMATION OF OPTIMAL MYOPIC ADVERSARIAL FORMULATION Having an arg max operator in the cost function is unpleasant, since it is non-differentiable. Instead we can approximate arg max by decreasing the temperature T for π T (s adv , a), lim Tadv→0 π Tadv (s adv , a) = 1 arg max a {π(sadv,a )} (a).

a) • π Tadv (s adv , a) (13) Therefore, our original optimization problem can be expressed as, min sadv∈D ,p (s) π(s, arg max a {π(s adv , a)}) = min sadv∈D ,p (s) lim Tadv→0 a π(s, a) • π Tadv (s adv , a). (14)

Figure 1: Games used in the experiments from Atari Arcade Environment. Games from left to right in order: Roadrunner, Riverraid, Bankkeist, Seaquest, Amidar, Beamrider, Pong and UpandDown.

Figure 2: Attack impacts vs temperature constant for different games 2 -norm bound with = 10 -8 . Left: Riverraid. Middle: Roadrunner. Right: Seaquest.

Figure 3: Attack impact vs logarithm base 10 of 2 -norm bound. Left: Amidar. Middle: Pong. Right: UpNDown.

Figure 4: Up: Average empirical probabilities of ranked actions for three different formulations. Left: Bankheist. Middle: Pong. Right: UpNDown. Down: Expected probability of ranked actions of three different formulations. Left: Riverraid. Middle: Amidar. Right: BeamRider.

Figure 5: Left: Impact change with varying for ∞ -norm bounded and 2 -norm bounded perturbation in Myopic formulation for Roadrunner game. Right: Impact change with varying for ∞ -norm bounded and 2 -norm bounded perturbation in Myopic formulation for Bankheist game.

Figure 6: Attack impacts vs number of iteration of random search (n) for Pong.

Figure 7: Example game from Atari Arcade Environment. The trained agent jumps in front of the car even though the agent is not in the same lane with the car in RoadRunner game.

Figure 8: Example game from Atari Arcade Environment. The trained agent forgets to recharge its oxygen even though its condition to recharge is not critical in Seaquest game.

Figure 9: Example game from Atari Arcade Environment. The trained agent just waits without moving until the enemy reaches the agent in Amidar game.

Left: Attack impacts for three attack formulations with 2 norm bound and = 10 -8 . Right: Attack impacts for three attack formulations with 1 norm bound and = 10 -8 .

Attack impacts for three attack formulations with ∞ norm bound and = 10 -8 .

Attack impacts for three different attack formulations with 2 norm bound and = 10 -10 , n = 1

A.4 Q-VALUES OVER TIME

In this section we plot state-action values of the agents over time without an attack, under attack with the Pattanaik et al. (2018) formulation, and under attack with our myopic formulation. In Figure 12 it is interesting to observe that under attack the mean of the state-action values over the episodes might be higher despite the average return being lower. . 

A.5 RESULTS ON AVERAGE EMPIRICAL PROBABILITIES OF ACTIONS

In this section we provide a detailed table on the average empirical probabilities of the best action a * and the worst action a w . It can be seen that in general E e∼ρ [p(s, a * )] is lower and E e∼ρ [p(s, a w )] is higher for our Myopic formulation when compared to other attack formulations. In this section we attempt to gain an insight into which pixels in the visited states are most sensitive to perturbation. For each pixel i, j we measure the drop in the state action values when perturbing that pixel by a small amount. Formally, let a sen = arg max a Q(s sensitivity , a), where s sensitivity is equal to s except that a small perturbation γ has been added to the i, j-th pixel of s. Then we measureWe plot the values from 19 in Figure 15 . Note that any non-zero value (i.e. any lighter colored pixel in the plot) indicates that a γ perturbation to the corresponding pixel will cause the agent to take an action different from the optimal one. Lighter colored pixels correspond to larger drops in Q-value. 

