UNCOVERING DIRECTIONS OF INSTABILITY VIA QUADRATIC APPROXIMATION OF DEEP NEURAL LOSS IN REINFORCEMENT LEARNING

Abstract

Learning in MDPs with highly complex state representations is currently possible due to multiple advancements in reinforcement learning algorithm design. However, this incline in complexity, and furthermore the increase in the dimensions of the observation came at the cost of non-robustness that can be taken advantage of (i.e. moving along worst-case directions in the observation space). To solve this policy instability problem we propose a novel method to ascertain the presence of these non-robust directions via quadratic approximation of the deep neural policy loss. Our method provides a theoretical basis for the fundamental cutoff between stable observations and non-robust observations. Furthermore, our technique is computationally efficient, and does not depend on the methods used to produce the worst-case directions. We conduct extensive experiments in the Arcade Learning Environment with several different non-robust alteration techniques. Most significantly, we demonstrate the effectiveness of our approach even in the setting where alterations are explicitly optimized to circumvent our proposed method.

1. INTRODUCTION

Since Mnih et al. (2015) showed that deep neural networks can be used to parameterize reinforcement learning policies, there has been substantial growth in new algorithms and applications for deep reinforcement learning. While this progress has resulted in a variety of new capabilities for reinforcement learning agents, it has at the same time introduced new challenges due to the non-robustness of deep neural networks to imperceptible adversarial perturbations originally discovered by Szegedy et al. (2014) . In particular, Huang et al. (2017) ; Kos & Song (2017) showed that the non-robustness of neural networks to adversarial perturbations extends to the deep reinforcement learning domain, where applications such as autonomous driving, automatic financial trading or healthcare decision making cannot tolerate such a vulnerability. There has been a significant amount of effort in trying to make deep neural networks robust to adversarial perturbations (Goodfellow et al., 2015; Madry et al., 2018; Pinto et al., 2017) . However, in this arms race it has been shown that deep reinforcement learning policies learn adversarial features independent from their worst-case (i.e. adversarial) training techniques (Korkmaz, 2022) . More intriguingly, a line of work has focused on showing the inevitability of adversarial examples and the intrinsic difficulty of learning robust models (Dohmatob, 2019; Mahloujifar et al., 2019; Gourdeau et al., 2019) . Given that it may not be possible to make DNNs completely robust to adversarial examples, a natural objective is to instead attempt to detect the presence of adversarial manipulations. In this paper we propose a novel identification method for adversarial directions in the deep neural policy manifold. Our study is the first one that focuses on detection of non-robust directions in the deep reinforcement learning neural loss landscape. Our approach relies on differences in the curvature of the neural policy in the neighborhood of an adversarial direction when compared to a baseline state observation. At a high level our method is based on the intuition that while baseline states have neighborhoods determined by an optimization procedure intended to learn a policy that works well across all states, each non-robust direction is the output of some local optimization in the neighborhood of one particular state. Our proposed method is computationally efficient, requiring only one gradient computation and two policy evaluations, requires no training that depends on the method used to compute the adversarial direction, and is theoretically well-founded. Hence, our study focuses on identification of non-robust directions and makes the following contributions: • Our paper is the first to focus on identification of adversarial directions in the deep reinforcement learning policy manifold. • We propose a novel method, Identification of Non-Robust Directions (INRD), to detect adversarial state manipulations based on the local curvature of the neural network policy. INRD is independent of the method used to generate the adversarial direction, computationally efficient, and theoretically justified.  Q(s t , a t ) + α[R t+1 + γ max a Q(s t+1 , a) -Q(s t , a t )]. Adversarial Examples: Goodfellow et al. (2015) introduced the fast gradient method (FGM) for producing adversarial examples for image classification. The method is based on taking the gradient of the training cost function J(x, y) with respect to the input image, and bounding the perturbation by where x is the input image and y is the output label. Later, an iterative version of FGM called I-FGM was proposed by Kurakin et al. (2016) . This is also often referred to as Projected Gradient Descent (PGD) as in (Madry et al., 2018) where the I-FGM update is x N +1 adv = clip (x N adv + αsign(∇ x J(x N adv , y))). where x 0 adv = x. Dong et al. (2018) further modified I-FGM by introducing a momentum term in the update, yielding a method called MI-FGSM. Korkmaz (2020) later proposed a Nesterovmomentum based approach for the deep reinforcement learning domain. The DeepFool method of Moosavi-Dezfooli et al. ( 2016) is an alternative approach to those based on FGSM. DeepFool performs iterative projection to the closest separating hyperplane between classes. Another alternative approach proposed by Carlini & Wagner (2017a) is based on finding a minimal perturbation that achieves a different target class label. The approach is based on minimizing the loss min s adv ∈S c • J(s adv ) + s adv -s   (2) where s is the clean input, s adv is the adversarial example, and J(s) is a modified version of the cost function used to train the network. Chen et al. (2018) proposed a variant of the Carlini & Wagner (2017a) formulation that adds an 1 -regularization term to produce sparser adversarial examples, min s adv ∈S c • J(s adv ) + λ  s adv -s  + λ  s adv -s   (3) Our method focusing on identifying non-robust directions in the deep neural policy manifold is the first method to investigate detection of adversarial manipulations in deep reinforcement learning. Our identification method does not require modifying the training of the neural network, does not require any training tailored to the adversarial method used, and uses only two neural network function evaluations and one gradient computation. Adversarial Deep Reinforcement Learning: The adversarial problem initially has been investigated by Huang et al. (2017) and Kos & Song (2017) concurrently. In this work the authors show that perturbations computed via FGSM result in extreme performance loss on the learnt policy. Lin et al. (2017) and Sun et al. (2020) focused on timing strategies in the adversarial formulation and utilized the Carlini & Wagner (2017a) method to produce the perturbations. While there is a reasonable body of work focused on finding efficient and effective adversarial perturbations, a substantial body of work focused on building agents robust to these perturbations. Mandlekar et al. (2017) proposed to utilize FGSM perturbations during training time to obtain more robust agents. Pinto et al. (2017) modeled the adversarial interaction as a zero sum game and proposed a joint training strategy to increase robustness in the continuous action space setting. Recently, Gleave et al. (2020) considered an adversary who is allowed to take natural actions in a given environment instead of p -norm bounded perturbations and modeled the adversarial relationship as a zero sum Markov game. However, recent concerns have been raised on the robustness of adversarial training methods by Korkmaz (2022) . In this paper the authors show that the state-of-the-art adversarial training techniques end up learning similar non-robust features. Thus, with the rising concerns on robustness of recent proposed adversarial training techniques our work aims to solve the adversarial problem from a different perspective by detecting adversarial directions.

3. IDENTIFICATION OF NON-ROBUST DIRECTIONS (INRD)

In this section we give the high-level motivation for and formal description of our identification method. We begin by introducing necessary notation and definitions. We denote an original clean state by s and an adversarially perturbed state by s adv . Definition 3.1. The cost of a state, J(s, τ ), is defined as the cross entropy loss between the policy π(a|s) of the agent, and a target distribution on actions τ (a). J(s, τ ) = - a τ (a) log(π(a|s)) Definition 3.2. The argmax policy, π * (a|s), is defined as the distribution which puts all probability mass on the highest weight action of π(a|s). π * (a|s) = 1(a = arg max a π(a |s)) (5) We use the following notation for the gradient and Hessian with respect to states s: ∇ s J(s 0 , τ 0 ) = ∇ s J(s, τ )| s=s0,τ =τ0 ∇ 2 s J(s 0 , τ 0 ) = ∇ 2 s J(s, τ )| s=s0,τ =τ0 3.1 FIRST-ORDER IDENTIFICATION OF NON-ROBUST DIRECTIONS (FO-INRD) As a naive baseline we first describe an identification method based on estimating how much the cost function J(s, τ ) varies under small perturbations. Prior work of Roth et al. (2019) ; Hu et al. (2019) has shown that the behavior of deep neural network classifiers under small, random perturbations is different at clean versus adversarial examples. Therefore, a natural baseline detection method is: given an input state s 0 sample a small random perturbation η ∼ N (0, I) and compute, K(s 0 , η) = J(s 0 + η, π * (•|s 0 )) -J(s 0 , π * (•|s 0 )). The first-order identification method proceeds by first estimating the mean and the variance of K over a clean run of the agent in the environment. Next a threshold t is chosen so that a desired false positive rate (FPR) is achieved (i.e. some desired fraction of the states in the clean run lie more than t standard deviations from the mean). Finally, at test time a state encountered by the agent is classified as adversarial if it is at least t standard deviations away from the mean. Otherwise the state is classified as clean. As a first attempt, the first-order method can be naturally interpreted as a finite-difference approximation to the magnitude of the gradient at s 0 . If we assume that the first-order Taylor approximation of J is accurate in a ball of radius r > centered at s 0 , then J(s 0 + η, π * (•|s 0 )) ≈ J(s 0 , π * (•|s 0 )) + ∇ s J(s 0 , π * (•|s 0 )) • η. Therefore, K(s 0 , η) ≈ ∇ s J(s 0 , π * (•|s 0 )) • η. (7) Thus, for η ∼ N (0, I) the test statistic K(s 0 , η) is approximately distributed as a Gaussian with mean 0 and variance 2 ∇ s J(s 0 , π * (•|s 0 )) 2 . Under this interpretation one would expect the test statistics for clean and adversarial states to have the same mean with potentially different standard deviations, possibly making it hard to distinguish clean from adversarial. However, this is not what we observe empirically, and in fact the first-order method does a decent job of detecting adversarial examples. The method works because, in fact, the mean of K(s, η) for clean examples s is reasonably well separated from the mean of K(s adv , η) for adversarial examples s adv . The empirical performance of the first-order method thus indicates that the assumption of accuracy for the first-order Taylor approximation of J does not hold in practice. This leads naturally to the consideration of information on the second derivatives (i.e. the local quadratic approximation) of J in order to identify non-robust directions.

3.2. SECOND-ORDER IDENTIFICATION OF NON-ROBUST DIRECTIONS (SO-INRD)

The second-order identification method is based on measuring the local curvature of the cost function J(s, τ ). The method exploits the fact that J(s, τ ) will have larger negative curvature at a clean example as compared to an adversarial example. In particular, the high level theoretical motivation for this approach is that adversarial examples are the output of a local optimization procedure which attempts to find a nearby perturbed state s adv with a low value for the cost J(s adv , τ ) for some τ = π * (•|s). A direction of large negative curvature for J(s adv , τ ) indicates that a very small perturbation along this direction could dramatically decrease the cost function. Therefore, such points are likely to be unstable for local optimization procedures attempting to minimize the cost function in a small neighborhood. On the other hand, the curvature of J(s, τ ) at a clean state s is determined by the overall algorithm used to train the deep reinforcement learning agent. This algorithm optimizes the parameters of the neural network policy while considering all states visited during training, and thus is not likely to be heavily overfit to the state s. In particular, we expect larger negative curvature at s than at an adversarial example s adv . We make the connection between negative curvature and instability for local optimization formal in Section 3.3. Based on the above discussion, a natural choice of metric for distinguishing adversarial versus clean examples is the most negative eigenvalue of the Hessian λ min ∇ 2 s J(s 0 , π * (•|s 0 ) . While this is the most natural measurement of curvature, it requires computing the eigenvalues of a matrix whose number of entries are quadratic in the input dimension. Since the input is very high-dimensional, and we would like to perform this computation in real-time for every state visited by the agent, computing the value λ min is computationally prohibitive. Instead we approximate this value by measuring the curvature along a direction which is correlated with the negative eigenvectors of the Hessian. Given this direction, the value that we measure is the accuracy of the first order Taylor approximation of the cost of the given state J(s, τ ). We denote the first order Taylor approximation at the state s 0 in direction η by J(s 0 , η) = J(s 0 , π * (•|s 0 )) + ∇ s J(s 0 , π * (•|s 0 )) • η. The metric we will use to detect adversarial examples is the finite-difference approximation L(s 0 , η) = J(s 0 + η, π * (•|s 0 )) -J(s 0 , η). ( ) To see formally that Equation ( 8) gives an approximation of the most negative eigenvector of the Hessian, we will assume that the cost function J(s, τ ) is well approximated by its quadratic Taylor approximation at the point s 0 i.e. J(s 0 + η, π * (•|s 0 )) ≈ J(s 0 , π * (•|s 0 )) + ∇ s J(s 0 , π * (•|s 0 )) • η + η ∇ 2 s J(s 0 , π * (•|s 0 ))η (9) for a small enough perturbation η. Substituting the above formula into Equation (8) yields L(s 0 , η) ≈ η ∇ 2 s J(s 0 , π * (•|s 0 ))η The above quadratic form is minimized when η lies in the same direction as the most negative eigenvector of the Hessian, in which case L(s 0 , η) ≈ λ min ∇ 2 s J(s 0 , π * (•|s 0 )) η 2 2 (11) We choose the sign of the gradient direction for measuring the accuracy of the first order Taylor approximation. To motivate this choice note that -∇ s J(s, τ ) is locally the direction of steepest Input: The clean run mean L and variance σ 2 (L), identification threshold t > 0,parameter > 0. for states s i visited by deep reinforcement learning policy do η i = sign(∇ s J(s i , π * (•|s i ))) ∇ s J(s i , π * (•|s i )) 2 J(s i , η i ) = J(s i , π * (•|s i )) + ∇ s J(s i , π * (•|s i )) • η i L(s i , η i ) = J(s i + η i , π * (•|s i )) -J(s i , η i ) if |L(s i , η i ) -L| > t • σ(L) then Label state s i as a non-robust direction end if end for decrease for the cost function. If the gradient direction additionally has negative curvature of large magnitude, then small perturbations along this direction will result in even more rapid decrease in the cost function value than predicted by the first-order gradient approximation. Note that this can be true even if the gradient itself has small magnitude, as long as the negative curvature is large enough. Thus, by the discussion at the beginning of Section 3.2, adversarial examples are likely to have relatively smaller magnitude negative curvature in the gradient direction than clean examples. Formally, for > 0 we set η(s 0 ) = sign (∇ s J(s 0 , π * (•|s 0 ))) ∇ s J(s 0 , π * (•|s 0 )) 2 . ( ) To calibrate the detection method we record the mean L = E s [L(s, η(s))] and variance σ 2 (L) = Var s [L(s, η(s))] of our proposed test statistic over states from a clean run of the policy in the MDP. Then at test time we set a threshold t > 0, and for each state s i visited by the agent test if |L(s i , η(s i )) -L| > tσ(L). (13) If the threshold of t standard deviations is exceeded we classify the state s i as adversarial, and otherwise classify it as clean. Pseudo-code for the second order method is given in Algorithm 1.

3.3. NEGATIVE CURVATURE AND INSTABILITY OF LOCAL OPTIMIZATION

In this section we formalize the connection between negative curvature and instability for local optimization procedures that motivated our definition of L(s, η). Given a state s 0 and a target distribution τ = π * (•|s 0 ), we assume the adversary is trying to find a state s adv minimizing J(s adv , τ ) among all states close to s 0 by some metric. Formally, let D s0 (s) ≥ 0 be a convex function of s that should be thought of as measuring distance to s 0 . One standard choice for the distance function is D s0 (s) = s -s 0 p p . We model the adversary as minimizing the loss f (s) = J(s, τ ) + D s0 (s). (14) In particular, we make the following assumption: Assumption 3.1. The adversarial state s adv is a local minimum of f (s). Of course this assumption is violated in practice since different methods used to compute adversarial directions optimize objective functions other than f , and do not necessarily always converge to a local minimum. Nevertheless the assumption allows us to make formal qualitative predictions about the behavior of the second-order identification method that correspond well with empirical results across a broad variety of methods for generating adversarial directions. We now state our main result lower bounding the curvature of J(s adv , τ ). Proposition 3.2. For c > 0 assume that the maximum eigenvalue of the Hessian ∇ 2 s D s0 (s) is bounded by c. If s * is a local minimum of f (s) then λ min (∇ 2 s J(s * , τ )) ≥ -c Proof. Let v be the eigenvector of ∇ 2 s J(s * , τ ) corresponding to the minimum eigenvalue. At a local minimum s * of f (s) the Hessian ∇ 2 s f (s * ) must be positive semi-definite. Therefore, Rearranging the above inequality completes the proof. 0 ≤ v ∇ 2 s f (s * )v = v ∇ 2 s J(s * , τ )v + v ∇ 2 s D s0 (s * )v ≤ λ min (∇ 2 s J(s * , τ )) + c The second order conditions for a local minimum of f imply a lower bound on the smallest eigenvalue of ∇ 2 s J(s * , τ ). Thus, by Assumption 3.1, we obtain a lower bound on λ min (∇ 2 s J(s adv , τ )). The assumption that the maximum eigenvalue of the Hessian ∇ 2 s D s0 (s) is bounded by c is satisfied for example when D s0 (s) = c 2 s -s 0 2 2 . In contrast, the local curvature of the cost function J(s, τ ) at a clean example is determined by an optimization procedure that updates the weights θ of the neural network policy rather than the states s. If we write J θ (s, τ ) to make explicit the dependence on the weights, then the second order conditions for optimizing the original neural network apply to the Hessian with respect to weights ∇ 2 θ J θ (s, τ ) rather than the Hessian with respect to states ∇ 2 s J θ (s, τ ). Additionally, first order optimality conditions can help to justify the choice of ∇ s J(s, τ ) as a good direction to check for negative curvature. Indeed by the first order conditions, at a local optimum s * of f (s) we have 0 = ∇ s f (s * ) = ∇ s J(s * , τ ) + ∇ s D s0 (s * ). (15) Therefore, ∇ s J(s * , τ ) = -∇ s D s0 (s * ). So assuming the adversary finds a local optimum, ∇ s J(s, τ ) points in a direction that decreases the distance function D s0 (s * ). Thus sufficiently negative curvature in the direction of ∇ s J(s, τ ) implies not only that s is not a local minimum of f , but also that the distance function D s0 (s) can be decreased by moving along this direction of negative curvature. To summarize, we have shown that second order optimality conditions arising from computing an adversarial example give rise to lower bounds on the smallest eigenvalue of the Hessian λ min ∇ 2 s J(s, τ ) . The function L(s, η) used to identify adversarial directions for SO-INRD is a finite difference approximation to η ∇ 2 s J(s, τ )η ≥ λ min ∇ 2 s J(s, τ ) η 2 . Therefore the results of this section imply that L(s, η) should be larger at adversarial examples than clean examples.

4. EXPERIMENTS

In our experiments agents are trained with DDQN Wang et al. (2016) in the Arcade Learning Environment (ALE) Bellemare et al. ( 2013) from OpenAI Brockman et al. (2016) . For a baseline we compare FO-INRD and SO-INRD with the detection method of OAO proposed by Roth et al. (2019) , which is based on estimating the average change in the odds ratio between classes under noise. In Figure 1 we plot the value of L(s) over states for various games without an adversarial attack and under adversarial attack with the following methods: Carlini & Wagner, Elastic Net, Nesterov Momentum, DeepFool, MIFGSM and FGSM. We show in the legends of Figure 1 the true positive rate (TPR) values for the different attacks when false positive rate (FPR) is equal to 0.01. The value of L(s) for clean states is generally well-concentrated and negative. On the other hand, for states computed by the different adversarial attack methods L(s) is clearly larger, matching the predictions of Proposition 3.2. The fact that L(s) is consistently larger at adversarial examples across a wide variety of adversarial perturbation methods indicates that Assumption 3.1 qualitatively captures the behavior of these methods. In particular the FGSM-based methods and DeepFool do not explicitly optimize an objective function of the form f (s) = J(s, τ ) + D s0 (s) as in Assumption 3.1. However, by enforcing a constraint on the distance of the adversarial example from the original clean example, these methods implicitly solve an optimization problem of the form given in ( 14), and thus exhibit the qualitative behavior predicted by Proposition 3.2. In In particular, the high level idea is to keep the attack as simple as possible while still accurately targeting the detection method. More specifically, the methodology is based on designing an attack based on gradient descent on some loss function. Further, minimizing the loss function should correspond closely to subverting the full detection method while still being possible to optimize. Critically, the authors highlight that while the choice of loss function to optimize can be a difficult task, the use of "feature matching" Gowal et al. (2019) can circumvent most of the current detection methods. We now describe how we applied the methodology discussed above to design detection aware adversaries for SO-INRD. As a first attempt, we tested the "feature matching" approach that was used to break the OAO detection method in Tramer et al. (2020) . This approach attempts to match the logits of the adversarial example to those of a clean example from a different class in order to evade detection. To optimize the loss for this method we used up to 1000 PGD iterations, and we searched step size varying from 0.01 to 10 -6 . We find that this method succeeds in reducing the TPR of the OAO method to nearly zero. It is also able to slightly reduce the TPR of our SO-INRD method (2017b) to compute the gradients. However, we also tried designing an adversary with a fully differentiable cost function by using a perturbation in the gradient direction (without the sign). We found that this fully differentiable adversary performed significantly worse than the one based on BPDA. We conducted exhaustive grid search over all the parameters in this optimization method: learning rate, iteration number, confidence parameter κ, and objective function parameter λ. In C&W we used up to 30000 iterations to find adversarial examples to bypass detection methods. We searched the confidence parameter from 0 to 50, the learning rate from 0.001 to 0.1, and λ from 0.001 to 10. In our grid search over these hyperparameters we found that there is a trade-off between the attack success rate and the detection of the perturbations. In other words, if we optimize the perturbation to be undetectable the success rate of the perturbation (i.e. the rate at which the perturbation actually makes the agent choose a non-optimal action) decreases. Therefore, when finalizing the hyperparameters for the SO-INRD detection-aware adversary we restricted our search to a setting where the decrease in the success rate of the attack was at most 10%. Since FO-INRD is based on sampling a random perturbation, we use another approach introduced by Carlini & Wagner (2017b) to minimize the expectation of the original loss function when averaged over the randomness used in the detection method. In particular, we estimate the expectation by computing the empirical mean of the loss over 50 samples from the same noise source. As for the case of SO-INRD we grid search over hyperparameters to achieve as low a TPR as possible while losing at most 10% in the success rate of the attack. Table 3 shows the TPR in the adversary-aware setting with the best hyperparameters found for each method. The fact that SO-INRD still performs quite well in the adversary-aware setting is an indication that there is a fundamental trade-off between computing an adversarial example and minimizing L(s). This trade-off makes sense in light of Proposition 3.2, which shows that searching for an adversarial example in a small neighborhood will tend to increase L(s).

6. CONCLUSION

In this paper we introduce a novel algorithm INRD, the first method for identification of adversarial directions in the deep neural policy manifold. Our method is theoretically motivated by the fact that local optimization objectives corresponding to the construction of adversarial directions lead naturally to lower bounds on the curvature of the cost function J(s, τ ). We have further shown empirically that the curvature of J(s, τ ) is significantly larger at adversarial states than at clean observations, leading to a highly effective method SO-INRD for detecting adversarial directions in deep reinforcement learning. We additionally demonstrate that SO-INRD remains effective in the adversary-aware setting, and connect this fact to our original theoretical motivation. We believe that due to the strong empirical performance and solid theoretical motivation SO-INRD can be an important step towards producing robust deep reinforcement learning policies.



Second Order Identification of Non-Robust Directions (SO-INRD)

Figure 1: L(s) for our proposed method SO-INRD vs visited states with corresponding TPR values for the following attack methods: FGSM, MI-FGSM, Nesterov, DeepFool, Carlini&Wagner, Elastic Net Method. TPR values shown in the upper right box of the figure when FPR is equal to 0.01.

Figure 2: ROC curves of FO-INRD, SO-INRD and OAO method for the following attack methods: FGSM, MI-FGSM, Nesterov Momentum, DeepFool, Carlini&Wagner, Elastic Net Method in RoadRunner. TPR values shown in the lower right box of the figure when FPR is equal to 0.01.

Figure 3: ROC curves of FO-INRD, SO-INRD and OAO method for the following attack methods: FGSM, MI-FGSM, Nesterov Momentum, DeepFool, Carlini&Wagner, Elastic Net Method in Robotank. TPR values are reported in the lower right box of the figure when FPR is equal to 0.01.

We conduct experiments in various MDPs from the Arcade Learning Environment that demonstrate the effectiveness of INRD in identifying adversarial directions computed via several state-of-the-art adversarial attack methods.• Most importantly, we demonstrate that INRD remains effective even against multiple methods for generating non-robust directions specifically designed to evade INRD.



Table 1 we show TPR values for FO-INRD, SO-INRD, and the OAO method under the FGSM, MI-FGSM, Nesterov Momentum, DeepFool, Carlini&Wagner, and Elastic-Net attacks when FPR is equal to 0.01. For all of the attack methods in all of the environments SO-INRD is able to detect

TPR for Feature Matching for SO-INRD and OAO method FPR=0.01 perturbations with large TPR. SO-INRD outperforms the other detection methods in all cases except for Nesterov Momentum in Alien and Seaquest where FO-INRD has TPR 0.997 and 0.980 while SO-INRD has 0.996 and 0.952. We also observe that while the perturbations computed by FGSM, MI-FGSM, Nesterov Momentum can generally be detected with large TPR values by all the detection methods, the perturbations computed by Carlini&Wagner and the Elastic-Net method are more difficult to detect. Despite the difficulty, SO-INRD achieves TPR values ranging from 0.713 to 0.988 for Carlini&Wagner, and TPR values ranging from 0.687 to 0.943 for Elastic-Net when FPR is equal to 0.01. In Figure2and Figure3we show ROC curves for each detection method under the FGSM, MI-FGSM, Nesterov Momentum, DeepFool, Carlini&Wagner and Elastic-Net method for RoadRunner and Robotank respectively. In Robotank the OAO method outperforms FO-INRD and even approaches the TPR of SO-INRD for high FPR under FGSM, MI-FGSM, Nesterov Momentum and DeepFool. However for the Carlini&Wagner and Elastic-Net attacks, SO-INRD has a much higher TPR across a wide range of FPR levels.5 COMPUTING ADVERSARIAL DIRECTIONS SPECIFICALLY TO EVADE INRDRecently,Tramer et al. (2020) introduced a comprehensive methodology for tailoring the optimization procedure used to produce adversarial examples in order to overcome detection and defense methods.

TPR values of INRD in the presence of a identification aware adversary when FPR=0.01. Recall that L(s) is consistently larger at adversarial examples than at clean examples. Thus the above optimization problem attempts to find adversarial examples with as small values of L(s) as possible. Since the function L(s) involves taking the sign of the gradient, we use Backwards Pass Differentiable Approximation (BPDA) as introduced in Carlini & Wagner

