THE ADVERSARIAL REGULATION OF THE TEMPORAL DIFFERENCE LOSS COSTS MORE THAN EXPECTED Anonymous

Abstract

Deep reinforcement learning research has enabled reaching significant performance levels for sequential decision making in MDPs with highly complex observations and state dynamics with the aid of deep neural networks. However, this aid came with a cost that is inherent to deep neural networks which have increased sensitivities towards indistinguishable peculiarly crafted non-robust directions. To alleviate these sensitivities several studies suggested techniques to cope with this problem via explicitly regulating the temporal difference loss for the worst-case sensitivity. In our study, we show that these worst-case regularization techniques come with a cost that intriguingly causes inconsistencies and overestimations in the stateaction value functions. Furthermore, our results essentially demonstrate that vanilla trained deep reinforcement learning policies have more accurate and consistent estimates for the state-action values. We believe our results reveal foundational intrinsic properties of the adversarial training techniques and demonstrate the need to rethink the approach to robustness in deep reinforcement learning.

1. INTRODUCTION

Advancements in deep neural networks have recently proliferated leading to expansion in the domains where deep neural networks are utilized including image classification (Krizhevsky et al., 2012) , natural language processing (Sutskever et al., 2014) , speech recognition (Hannun et al., 2014) and self learning systems via exploration. In particular, deep reinforcement learning has become an emerging field with the introduction of deep neural networks as function approximators (Mnih et al., 2015) . Hence, deep neural policies have been deployed in many different domains from pharmaceuticals to self driving cars (Daochang & Jiang, 2018; Huan-Hsin et al., 2017; Noonan, 2017) . As the advancements in deep neural networks continued a line of research focused on their vulnerabilities towards a certain type of specifically crafted perturbations computed via the cost function used to train the neural network (Szegedy et al., 2014; Goodfellow et al., 2015; Madry et al., 2018; Kurakin et al., 2016; Dong et al., 2018) . While some research focused on producing optimal p -norm bounded perturbations to cause the most possible damage to the deep neural network models, an extensive amount of work focused on making the networks robust to such perturbations (Madry et al., 2018; Carmon et al., 2019; Raghunathan et al., 2020) . The vulnerability to such particularly optimized adversarial directions was inherited by deep neural policies as well (Huang et al., 2017; Kos & Song, 2017; Korkmaz, 2022) . Thus, robustness to such perturbations in deep reinforcement learning became a concern for the machine learning community, and several studies proposed various methods to increase robustness (Pinto et al., 2017; Gleave et al., 2020) . Thus, in this paper we focus on adversarially trained deep neural policies and the state-action value function learned by these training methods in the presence of an adversary. In more detail, in this paper we aim to seek answers for the following questions: (i) How accurate is the state-action value function on estimating the values for state-action pairs in MDPs with high dimensional state representations?, (ii) Does adversarial training affect the estimates of the state-action value function?, (iii) What are the effects of training with worst-case distributional shift on the state-action value function representation for the optimal actions? and (iv) Are there any fundamental trade-offs intrinsic to explicit worst-case regularization in deep neural policy training? To be able to answer these questions we focus on adversarial training and robustness in deep neural policies and make the following contributions: • We conduct an investigation on the state-action values learnt by the state-of-the-art adversarially trained deep neural policies and vanilla trained deep neural policies. • We provide theoretically motivated justification for how adversarial training might change the state-action value function. • We perform several experiments in Atari games with large state spaces from the Arcade Learning Environment (ALE). With our systematic analysis we show that vanilla trained deep neural policies have a more accurate representation of the sub-optimal actions compared to the state-of-the-art adversarially trained deep neural policies. • Furthermore, we show the inconsistencies in the action ranking in the state-of-the-art adversarially trained deep neural policies. Thus, these results demonstrate the loss of information in state-action value function as a novel fundamental trade-off intrinsic to adversarial training. • More importantly, we demonstrate that state-of-the-art adversarially trained deep neural policies learn overestimated state-action value functions. • Finally, we explain how our results call into question the hypothesis initially proposed by Bellemare et al. (2016) relating the action gap and overestimation.

2. BACKGROUND AND PRELIMINARIES

Preliminaries: In deep reinforcement learning the goal is to learn a policy for taking actions in a Markov Decision Process (MDP) that maximize discounted expected cumulative reward. An MDP is represented by a tuple M = (S, A, P, r, ρ 0 , γ) where S is a set of continuous states, A is a discrete set of actions, P is a transition probability distribution on S × A × S, r : S × A → R is a reward function, ρ 0 is the initial state distribution, and γ is the discount factor. The goal in reinforcement learning is to learn a policy π : S → P(A) which maps states to probability distributions on actions in order to maximize the expected cumulative reward R = E T -1 t=0 γ t r(s t , a t ) where a t ∼ π(s t ). In Q-learning Watkins (1989) the goal is to learn the optimal state-action value function Q * (s, a) = R(s, a) + s ∈S P (s |s, a) max a ∈A Q * (s , a ). Thus, the optimal policy is determined by choosing the action a * (s) = arg max a Q(s, a) in state s. Adversarial Crafting and Training: Szegedy et al. (2014) observed that imperceptible perturbations could change the decision of a deep neural network and proposed a box constrained optimization method to produce such perturbations. Goodfellow et al. (2015) suggested a faster method to produce such perturbations based on the linearization of the cost function used in training the network. Kurakin et al. (2016) proposed the iterative version of the fast gradient sign method proposed by Goodfellow et al. (2015) inside an -ball. x N +1 adv = clip (x N adv + αsign(∇ x J(x N adv , y))) in which J(x, y) represents the cost function used to train the deep neural network, x represents the input, and y represents the output labels. While several other methods have been proposed (e.g. Korkmaz (2020)) using a momentum-based extension of the iterative fast gradient sign method, v t+1 = µ • v t + ∇ sadv J(s t adv + µ • v t , a) ∇ sadv J(s t adv + µ • v t , a) 1 (2) s t+1 adv = s t adv + α • v t+1 v t+1 2 (3) adversarial training has mostly been conducted with perturbations computed by projected gradient descent (PGD) proposed by Madry et al. (2018) (i.e. Equation 1).

Adversaries and Training in Deep Neural Policies:

The initial investigation on resilience of deep neural policies was conducted by Kos & Song (2017) and Huang et al. (2017) concurrently based on the utilization of the fast gradient sign method proposed by Goodfellow et al. (2015) . Korkmaz (2022) showed that deep reinforcement learning policies learn shared adversarial features across MDPs. While several studies focused on improving optimization techniques to compute optimal perturbations, a line of research focused on making deep neural policies resilient to these perturbations. Mandlekar et al. (2017) proposed including these perturbations in training time to increase resilience for robotic setups. Pinto et al. (2017) proposed to model the dynamics between the adversary and the deep neural policy as a zero-sum game where the goal of the adversary is to minimize expected cumulative rewards of the deep neural policy. Gleave et al. (2020) 

3. ADVERSARIAL TRAINING AND THE STATE-ACTION VALUE FUNCTION

In this paper we aim to answer the following questions: • How does training with explicit worst-case regularization affect the estimates of the optimal state-action values in MDPs with high dimensional state representations? • What is the accuracy of the state-action value function representation for the non-optimal actions in deep neural policies? • Does state-of-the-art adversarial training affect the state-action value estimates? • Are there any intrinsic trade-offs tied to adversarial deep neural policy training? While the goal in Q-learning is to learn the state-action value function Q(s, a) that maximizes expected discounted cumulative rewards, in deep Q-learning an additional concern arises from susceptibility towards adversarial perturbations due to the nonlinear function approximator used in learning the Q-function. Ideally, one might hope that adversarial training would reduce the vulnerability of the Q-function to adversarial perturbations while preserving the Q-values of the non-perturbed states as much as possible. The theoretically motivated adversarial training techniques achieve certified defense against adversarial perturbations inside the -ball D (s) = {s : s -s ∞ ≤ }. However, we show that this approach induces significant changes in the Q-function so that the Q-function loses its accuracy for the non-perturbed states. In particular, adversarial training causes deep neural policies to learn overestimated state-action values, and the Q-values for non-optimal actions are reduced in accuracy to the point where their relative ranking changes. In the remainder of this section we give theoretical motivation for these empirical results. In particular we demonstrate that in the setting of linear function approximation, adversarial training can potentially lead to overestimation for the Q-values of the optimal actions, and reordering of the ranking of nonoptimal actions. The basic approach of adversarial training techniques is based on adding a regularizer to the standard Q-learning update. The regularizer is designed to penalize Q-functions for which a perturbed state s ∈ D (s) can change the identity of the highest Q-value action. For the baseline adversarial training technique we will theoretically analyze the effects of this regularizer. Definition 3.1 (Huan et al. ( 2020)). For a state s let a * (s) = arg max a Q(s, a). The regularizer is given by R(θ) = s max s∈D (s) max a =a * (s) Q θ (s, a) -Q θ (s, a * (s)) . The adversarial training algorithm proceeds by adding R(θ) to the standard temporal difference loss L(θ) = L H r + γ max a Q target (s , a ) -Q θ (s, a) + R(θ) used in DQN and minimizing via stochastic gradient descent. We now describe the construction of an MDP M with linear function approximation where the use of the regularizer causes overestimation and reordering of suboptimal actions. There are two states parametrized by feature vectors s 1 , s 2 ∈ R n , and there are three possible actions {a i } 3 i=1 in each state. Taking any of the three actions in state s 1 leads to a transition to state s 2 and vice versa. Let 1 > γ > 0 be the discount factor, and let δ > η > 0 be small constants with γ > δ. The rewards for each action are as follows: r(s 1 , a 1 ) = 1 -γ, r(s 1 , a 2 ) = η -γ, r(s 1 , a 3 ) = δ -γ, r(s 2 , a 1 ) = η -γ, r(s 2 , a 2 ) = 1 -γ, and r(s 2 , a 3 ) = δ -γ. Clearly, the optimal policy is to always take action a 1 in state s 1 , and action a 2 in state s 2 as these are the only actions giving positive reward. Thus the optimal state-action values are given by: Q * (s 1 , a 1 ) = Q * (s 2 , a 2 ) = ∞ t=0 (1 -γ)γ t = 1, Q * (s 1 , a 2 ) = Q * (s 2 , a 1 ) = η -γ + γ ∞ t=0 (1 -γ)γ t = η , and Q * (s 1 , a 3 ) = Q * (s 2 , a 3 ) = δ -γ + γ ∞ t=0 (1 -γ)γ t = δ. Let the Q-function be linearly parametrized by θ = (θ 1 , θ 2 , θ 3 ) so that Q θ (s, a i ) = θ i , s . Finally, let z i for i ∈ {1, 2, 3} be three orthonormal vectors, and let the state feature vectors satisfy: 1. s 1 = z 1 + δz 3 + ηz 2 and 2. s 2 = z 2 + δz 3 + ηz 1 Then it follows that the optimal Q-function is parametrized by θ * = (θ * 1 , θ * 2 , θ * 3 ) where θ * i = z i i.e. Q θ * (s, a) = Q * (s, a) for all s and a. Thus, according to the function Q θ * (s, a), for s 1 the best action is a 1 , for s 2 the best action is a 2 , and in all states the second-best action is a 3 . Next we identify the optimal perturbations used in the computation of the regularizer R(θ * ) for this setting. Proposition 3.1. In the MDP M suppose that < δ-η 2 . 1. For s = s 1 : s + √ 2 (θ * 3 -θ * 1 ) = arg max s∈D (s) max a =a * (s) Q θ * (s, a) -Q θ * (s, a * (s)) 2. For s = s 2 : s + √ 2 (θ * 3 -θ * 2 ) = arg max s∈D (s) max a =a * (s) Q θ * (s, a) -Q θ * (s, a * (s)) Proof. We will prove item 1, and item 2 will follow from an identical argument with roles of θ * 1 and θ * 2 swapped. Let s = s 1 . Any s ∈ D (s) can be written as s + v where v is a unit vector. Thus, θ * 3 , s = θ * 3 , s + θ * 3 , v > θ * 3 , s -= δ -. Similarly we have θ * 2 , s < θ * 2 , s + = η + . Since < δ-η 2 , we conclude that θ * 3 , s > θ * 2 , s for all s ∈ D (s). Therefore, in state s the action maximizing max a =a * (s) Q θ * (s, a) -Q θ * (s, a * (s)) will always be a 3 . This implies that arg max s∈D (s) max a =a * (s) Q θ * (s, a) -Q θ * (s, a * (s)) = arg max s∈D (s) θ * 3 , s -θ * 1 , s . This is the maximum in a ball of radius around s of the linear function θ * 3 -θ * 1 , s . Therefore the maximum is achieved by s = s + √ 2 (θ * 3 -θ * 1 ) as desired. In words, the optimal direction to perturb the state s 1 in order to have a * (s) = a * (s) is toward θ * 3 -θ * 1 . Similarly for the state s 2 , the optimal perturbation is toward θ * 3 -θ * 2 . Next we use this fact to show that in order to decrease the regularizer it is sufficient to simply increase the magnitude of θ 1 and θ 2 , and decrease the magnitude of θ 3 . Proposition 3.2. In the MDP M let λ > 0 and suppose that < (1-λ)δ-(1+λ)η 2 . Let θ = (θ 1 , θ 2 , θ 3 ) be given by θ 1 = (1 + λ)θ * 1 , θ 2 = (1 + λ)θ * 2 and θ 3 = (1 -λ)θ * 3 . Then R(θ) < R(θ * ). Proof. By an identical argument to that in Proposition 3.1 we have that a 3 is always the action maximizing max a =a * (s) Q θ (s, a) -Q θ (s, a * (s)) whenever < (1-λ)δ-(1+λ)η 2 . This condition is satisfied by assumption. Therefore, we conclude that for s = s 1 , the optimal s ∈ D (s) for the scaled parameters θ is given by s = s + √ 2(1+λ 2 ) (θ 3 -θ 1 ). Therefore, the contribution to the sum defining R(θ) from state s 1 is given by (θ 3 -θ 1 ), s = (θ 3 -θ 1 ), s + 2(1 + λ 2 ) = -(1 + λ) + (1 -λ)δ + 2(1 + λ 2 ) (6) where the last step uses the fact that s = θ * 1 + δθ * 3 + ηθ * 2 and that the vectors θ * i are orthonormal. Next using the fact that √ 1 + λ 2 < 1 + λ for all λ > 0 we conclude (θ 3 -θ 1 ), s < -(1 + λ) + (1 -λ)δ + √ 2 + λ √ 2 < -(1 + λ) + δ + √ 2. ( ) The final inequality follows from the fact that < δ 2 so λ √ 2 -λδ < 0. Switching to type 2 actions an identical proof (with θ 1 replaced by θ 2 ) yields the same value for the contribution of type 2 actions to the sum. By Proposition 3.1, the contribution of each type of state to the sum defining R(θ * ) is (θ * 3 -θ * 1 ), s + √ 2 (θ * 3 -θ * 1 ) = -1 + δ + √ 2. ( ) Clearly the contribution of each state in 7 is strictly less than that in 8. Therefore R(θ) < R(θ * ). Theorem 3.3. If a linear state-action value function approximator Q θ (s, a) is used, then the regularizer R(θ) can lead to overestimation of the value of the optimal action, and re-ordering of the values of the suboptimal actions. Proof. Consider the MDP M with linear function approximation constructed above. Increasing the magnitude of θ * 1 and θ * 2 by a factor of 1 + λ leads to overestimation of the Q-value of the best action in both state s 1 and s 2 by the same factor. Additionally decreasing the magnitude of θ * 3 can lead to a change in the ranking of the suboptimal actions. Indeed if 1+λ 1-λ > δ η then a 3 will become the third ranked action in both states. Therefore, Proposition 3.2 proves that changing θ to decrease the regularizer R(θ) can lead to both overestimation of the first ranked action, and re-ordering of the ranking of the suboptimal actions. While we showed how this can potentially happen in the case of linear function approximation, we will see that this is a general phenomenon which occurs with neural-network approximation of the Q-function in adversarially trained agents. It is important to note that the issues we identify are a result of the fundamental differences between deep neural policies and classification tasks where adversarial training has previously been applied. In particular, the fact that the state-action value function Q(s, a) has a meaning (i.e. measuring expected cumulative rewards) with regard to the MDP beyond simply labelling the optimal action correctly is the root cause of the effects that we observe. In other words, simply penalizing the state-action value function for assigning the wrong "label" to an adversarial example can have unintended, potentially detrimental consequences for learning an accurate state-action value function.

4. MEASURING THE ACCURACY OF STATE-ACTION VALUES

In this section we provide a methodology to measure the accuracy of the state-action value function in representing values for the non-optimal actions. At a high-level, our approach is based on action modification and the relative performance drop P as defined below: Definition 4.1. The performance drop of an agent when modifying the agent's actions is given by P = Score base -Score actmod Score base -Score min . where Score base represent the baseline run of the game with no action modification, Score min represents the minimum score available for a given game, and Score actmod represents the run of the game where the actions of the agent are modified for a fraction of the state observations. We now explain precisely how we propose to measure "accuracy" for non-optimal actions. Formally, let a i be the i th best action decided by the deep neural policy in a given state s (i.e. Q(s, a) is sorted in decreasing order, and a i is the action corresponding to i th largest Q-value). For a trained agent, the value of Q(s, a i ) should represent the expected cumulative rewards obtained by taking action a i in state s, and then taking the highest Q-value action (i.e. a 1 ) in every subsequent state. Thus, a natural test to perform would be: pick a random state s, make the agent choose action a i in state s, and in all other states have the agent choose the highest Q-value action. By comparing the relative performance drop P in this test to a clean run where the agent always takes the highest Q-value action, one can measure the decline in rewards caused by taking action a i . Further, we can provide a measure of accuracy for the state-action value function by comparing the results of the test for each i ∈ {1, 2 . . . |A|}, and checking that the relative performance drops P i are in the correct order i.e. 0 = P 1 ≤ P 2 • • • ≤ P |A| . However, there is an issue with the above proposal. It is often the case that there are many states s in which the action taken has very little impact on the final rewards. Instead, there are a relatively smaller number of critical states in which the action taken has a large impact. Thus, picking a single random state s in which to take action a i will have a statistically insignificant impact on the final rewards in the game. Therefore we modify the test described above by instead sampling a p-fraction of the states in the episode uniformly at random, and making the agent take action a i in each of the sampled states. We then record the relative performance drop as a function of p, yielding a reward curve P i (p). More formally, we define Using these reward curves one can check whether P i (p) lies above P j (p) whenever i > j. Of course one curve may not always lie strictly above or below another, so we introduce the following definition to quantitatively capture the relative ordering of performance drop curves. Definition 4.3. Let F : [0, 1] → [0, 1] and G : [0, 1] → [0, 1]. For any τ > 0, we say that the F τ -dominates G if 1 0 (F (p) -G(p)) dp > τ . To compare the accuracy of state-action values for vanilla versus adversarially trained agents, we can thus perform the above test, and check the relative ordering of the curves P i (p) using Definition 4.3 for each agent type. In addition, we can also directly compare for each i the curve P adv i (p) for the adversarially trained agent with the curve P vanilla i (p) for the vanilla trained agent. This is possible because P i (p) measures the performance drop of the agent relative to a clean run, and so always takes values on a normalized scale from 0 to 1. Thus, if we observe for example that P adv 2 (p) τ -dominates P vanilla 2 (p) for some τ > 0, we can conclude that the state-action value function of the vanilla trained agent more accurately represents the second-best action than that of the adversarially trained agent.

5. EXPERIMENTAL DETAILS

The experiments are conducted in high dimensional state representation MDPs. In particular, our experiments are conducted in the Arcade Learning Environment (ALE) (Bellemare et al., 2013) in the OpenAI (Brockman et al., 2016) baseline version. The vanilla trained deep neural policy is trained via Double Deep Q-Network (DDQN) (Wang et al., 2016) initially proposed by Hasselt et al. (2016) with prioritized experience replay proposed by (Schaul et al., 2016) , and the state-of-the-art adversarially trained deep neural policy is trained via State-Adversarial Double Deep Q-Network (SA-DDQN) (Section 2) with prioritized experience replay (Schaul et al., 2016) . The results are averaged over 10 episodes. We explain in detail all the necessary hyperparameters for the implementation in the supplementary material. The standard error of the mean is included for all of the figures and tables. Note that in the main body of the paper we focus on the baseline adversarial training. In the supplementary material we also provide analysis on the follow-up more recent studies in adversarial training techniques. The results reported for all of the adversarial training techniques remains the same that the adversarially trained policies learn inaccurate, inconsistent and overestimated state-action values.

6. AN ANALYSIS ON THE STATE-ACTION VALUE FUNCTION REPRESENTATION

In this section we demonstrate that the state-action value function of adversarially trained deep neural policies provides inaccurate estimates for the non-optimal actions, and learns overestimated state-action values. This confirms that the theoretically-motivated problems discussed in Section 3 do indeed occur in practice for deep neural policies. In particular, to evaluate the accuracy on non-optimal actions we use the methodology discussed in Section 4 of measuring the performance drop P i (p) that occurs when causing the deep neural policy to take the i-th best action in a p fraction of states. Our aim is to provide an analysis on how accurate the state-action value function is in representing values for both optimal and non-optimal actions for vanilla trained deep neural policies and state-of-the-art adversarially trained deep neural policies.

6.1. INACCURACY OF STATE-ACTION VALUES FOR NON-OPTIMAL ACTIONS

In Figure 1 we show the performance drop P 2 (p) as a function of the fraction of states p in which the action modification is applied for state-of-the-art adversarially trained deep neural policies and vanilla trained deep neural policies. In particular, the action modification is set for the second best action a 2 decided by the state-action value function Q(s, a). As we increase the fraction of states in which the action modification set to a 2 is applied, we observe a performance drop for both of the deep neural policies. However, we observe that the vanilla trained deep neural policies experience a lower performance drop with this modification. Especially in BankHeist we observe that the performance drop does not exceed 0.55 even when the action modification is applied for a large fraction of the visited states for the vanilla trained deep neural policies. This gap in the performance drop between the adversarially trained and vanilla trained deep neural policies indicates that the state-action value function learnt by vanilla trained deep neural policies has a better estimate for the non-optimal actions. As we measured the impact of a 2 modification on the policy performance, we further test a w = arg min a Q(s, a) modification (i.e. worst possible action in a given state modification) on the deep neural policy. Figure 2 shows that the performance drop P w (p) is higher in the vanilla trained deep neural policies compared to adversarially trained deep neural policies when the action modification is set to a w . This again further demonstrates that the state-action value function learnt by the vanilla trained deep neural policy has a more accurate representation over the non-optimal actions. We hypothesize that adversarial training places higher emphasis on ensuring that the highest ranked action (i.e. the action that maximizes the state-action value function in a given state) does not change under small p -norm bounded perturbations, rather than accurately computing the state-action value function as discussed in Section 3. Since historically Q-learning suffered from overestimation of Q-values, a method which places higher emphasis on the highest ranked action risks converging to a state-action value function with overestimated Q-values. We further demonstrate this in Section 6.2.

6.2. OVERESTIMATION OF Q-VALUES IN ADVERSARIALLY TRAINED DEEP NEURAL POLICIES

Overestimation of Q-values was initially discussed by Thrun & Schwartz (1993) as a byproduct of the use of function approximators, and was subsequently explained as being caused by the use of the max operator in Q-learning (van Hasselt, 2010) . Furthermore, overestimation bias resulting in learning of sub-optimal policies was demonstrated in practice by Hasselt et al. (2016) . In this subsection we empirically demonstrate that state-of-the-art adversarial training indeed leads to overestimation in Q-values, as hypothesized in Section 3. In particular, Figure 4 and Table 2 show the overestimation bias on the state-action values learned by the state-of-the-art adversarially trained deep neural policies. Considering that overestimation bias is still an issue and active area of research for vanilla deep neural policy training (Lan et al., 2020; Anschel et al., 2017; Kuznetsov et al., 2020) , the additional bias introduced intrinsic to adversarial training must be addressed to be able to learn optimal policies. In this subsection we demonstrate the inconsistencies in the nonoptimal action ranking in adversarially trained policies. In particular, in Figure 3 in BankHeist choosing the worst action leads to a smaller performance drop than choosing the second best action i.e. P w (p) < P 2 (p) for all p. Thus, this demonstrates that the state-action value function is not ranking the sub-optimal actions accurately. While learning an accurate representation of the state-action values is important for obtaining a policy that aims to maximize expected cumulative rewards, learning the correct order of the actions can also solve this problem. Furthermore, in some cases the deep neural policy indeed must know the correct order of the actions due to the presence of an obstruction that blocks the optimal action either due to the existence of other agents or environmental effects (Rashid et al., 2020; Gleave et al., 2020) . In particular, in safe reinforcement learning several algorithms have been proposed to learn the ranking of the actions so that the agent can choose the next-best ranked action in safety critical situations (Alshiekh et al., 2018) . Some work has also pointed out that in some cases learning the relative rank of the actions (Lin & Zhou, 2020 ) can be more sample efficient than learning correct estimates of the state-action values. While the inconsistency in action ranking for adversarially trained deep neural policies can be seen as a vulnerability problem from a security point of view, most intriguingly these results demonstrate the loss of information in state-action value function as a novel fundamental trade-off intrinsic to adversarial training.

6.4. ACTION GAP PHENOMENON

The action gap is defined as the difference between the state-action value of the optimal action and the state-action value of the second ranked action. κ(Q, s) = max a ∈A Q(s, a ) - max a / ∈arg max a ∈A Q(s,a ) Q(s, a) Initially, Farahmand (2011) describes the existence of a large action gap as a desirable property of an MDP, which makes learning an optimal policy easier. Subsequently, Bellemare et al. (2016) proposed a connection between the action gap and the overestimation of Q-values, and in particular hypothesized that increasing the action gap of the learned value function causes a decrease in overestimation of Q-values. Following this study, several papers built on the hypothesis that increasing the action gap causes reduction in bias (Smirnova & Dohmatob, 2020; Fox et al., 2016; Jain et al., 2020; Lu et al., 2019) . In Figure 5 we show that adversarial training increases the action gap. Thus, the fact that adversarially trained deep neural policies overestimate the optimal state-action values (see Section 6.2) refutes the hypothesis that increasing the action gap is the sole cause of a decrease in overestimation bias of state-action values. We hypothesize that the consistent Bellman operator (Bellemare et al., 2016) may cause a decrease in overestimation for a different reason. In particular, the consistent Bellman operator corresponds to a special case of a certain reparameterization of Kullback-Leibler regularization for value iteration (Vieillard et al., 2020) . Thus, it may be the case that the decrease in overestimation of Q-values and improvement in performance is due to a type of implicit regularization rather than to an increase of the action gap. Hence, our results show that increasing the action gap alone may coincide with an increase in overestimation of Q-values.

7. CONCLUSION

In this paper we focus on the state-action value function learnt via the state-of-the-art adversarially trained deep neural policies and vanilla trained deep neural policies. We provide theoretical analysis on the fundamental effects caused by adversarial training on the state-action value function. Furthermore, we conduct manifold experiments in the Arcade Learning Environment and with our systematic analysis we demonstrate that vanilla trained deep neural policies have more accurate and consistent estimates for the state-action values than the state-of-the-art adversarially trained deep neural policies. More intriguingly, we show that adversarially trained deep neural policies in certain MDPs completely loses all the information in the state-action value function that contains the relative ranking of the actions. More importantly, we show that state-of-the-art adversarially trained deep neural policies learn overestimated state-action values. We believe our investigation lays out intrinsic properties of adversarial training while systematically revealing the underlying vulnerabilities, and can be conducive to building robust and optimal deep neural policies.



Note that due to the fact that the adversarially trained deep neural policy overestimates Q-values, we introduce a normalization in order to compare the action gaps of adversarially and vanilla trained policies. In particular, in Figure5we report normalized Q-values in each state s by dividing Q(s, a) by a |Q(s, a)|.



Figure 1: Performance drop P 2 (p) with respect to action modification a w for the state-of-the-art adversarially trained deep neural policies and vanilla trained deep neural policies. Left: BankHeist. Center: RoadRunner. Right: Freeway. Definition 4.2. Let M be an MDP and Q(s, a) be a state-action value function for M. In each state label the actions a 1 , . . . a |A| in order so thatQ(s, a 1 ) ≥ Q(s, a 2 ) • • • ≥ Q(s, a |A| ).We define the performance curve P i (p) to be the expected performance drop of an agent in M which takes action a i in a randomly sampled p-fraction of states, and takes action a 1 in all other states.

Figure 2: Performance drop P w (p) with respect to action modification a w for the state-of-the-art adversarially trained deep neural policies and vanilla trained deep neural policies.

Figure 4: Q-value of the best action a * over the states for the state-of-the-art adversarially trained deep neural policy and vanilla trained deep neural policy. Table 2: Average Q-values of the optimal action in state-of-the-art adversarially trained deep neural policies and vanilla trained deep neural policies.

Figure 3: P 2 and P w for adversarially trained deep neural policies.

Figure 5: Normalized state-action values for the best action a * , second best action a 2 and worst action a w over states. Row1: Vanilla trained policies. Row2: State-of-the-art adversarially trained policies. Table 3: Normalized state-action value estimates 1 and state-action value estimate shift for the second best action in state-of-the-art adversarially trained deep neural policies. Q(s, a) Q(s, a * ) Q(s, a 2 ) Q(s, a w ) ALE Adversarial Vanilla Adversarial Vanilla Adversarial Vanilla BankHeist 0.1894±0.002 0.170±0.003 0.130±0.0006 0.169±0.002 0.127±0.0010 0.161±0.004 RoadRunner 0.1696±0.008 0.236±0.094 0.132±0.0026 0.159±0.079 0.126±0.0049 -0.265±0.071 Freeway 0.1894±0.002 0.341±0.008 0.130±0.0006 0.333±0.002 0.127±0.0010 0.325±0.009

Area under the curve of performance drop under action modification (AM) a 2 and a w for the state-of-the-art adversarially trained deep neural policies and vanilla trained deep neural policies.

Average Q-values of the optimal action in state-of-the-art adversarially trained deep neural policies and vanilla trained deep neural policies.

