THE ADVERSARIAL REGULATION OF THE TEMPORAL DIFFERENCE LOSS COSTS MORE THAN EXPECTED Anonymous

Abstract

Deep reinforcement learning research has enabled reaching significant performance levels for sequential decision making in MDPs with highly complex observations and state dynamics with the aid of deep neural networks. However, this aid came with a cost that is inherent to deep neural networks which have increased sensitivities towards indistinguishable peculiarly crafted non-robust directions. To alleviate these sensitivities several studies suggested techniques to cope with this problem via explicitly regulating the temporal difference loss for the worst-case sensitivity. In our study, we show that these worst-case regularization techniques come with a cost that intriguingly causes inconsistencies and overestimations in the stateaction value functions. Furthermore, our results essentially demonstrate that vanilla trained deep reinforcement learning policies have more accurate and consistent estimates for the state-action values. We believe our results reveal foundational intrinsic properties of the adversarial training techniques and demonstrate the need to rethink the approach to robustness in deep reinforcement learning.

1. INTRODUCTION

Advancements in deep neural networks have recently proliferated leading to expansion in the domains where deep neural networks are utilized including image classification (Krizhevsky et al., 2012) , natural language processing (Sutskever et al., 2014) , speech recognition (Hannun et al., 2014) and self learning systems via exploration. In particular, deep reinforcement learning has become an emerging field with the introduction of deep neural networks as function approximators (Mnih et al., 2015) . Hence, deep neural policies have been deployed in many different domains from pharmaceuticals to self driving cars (Daochang & Jiang, 2018; Huan-Hsin et al., 2017; Noonan, 2017) . As the advancements in deep neural networks continued a line of research focused on their vulnerabilities towards a certain type of specifically crafted perturbations computed via the cost function used to train the neural network (Szegedy et al., 2014; Goodfellow et al., 2015; Madry et al., 2018; Kurakin et al., 2016; Dong et al., 2018) . While some research focused on producing optimal p -norm bounded perturbations to cause the most possible damage to the deep neural network models, an extensive amount of work focused on making the networks robust to such perturbations (Madry et al., 2018; Carmon et al., 2019; Raghunathan et al., 2020) . The vulnerability to such particularly optimized adversarial directions was inherited by deep neural policies as well (Huang et al., 2017; Kos & Song, 2017; Korkmaz, 2022) . Thus, robustness to such perturbations in deep reinforcement learning became a concern for the machine learning community, and several studies proposed various methods to increase robustness (Pinto et al., 2017; Gleave et al., 2020) . Thus, in this paper we focus on adversarially trained deep neural policies and the state-action value function learned by these training methods in the presence of an adversary. In more detail, in this paper we aim to seek answers for the following questions: (i) How accurate is the state-action value function on estimating the values for state-action pairs in MDPs with high dimensional state representations?, (ii) Does adversarial training affect the estimates of the state-action value function?, (iii) What are the effects of training with worst-case distributional shift on the state-action value function representation for the optimal actions? and (iv) Are there any fundamental trade-offs intrinsic to explicit worst-case regularization in deep neural policy training? To be able to answer these questions we focus on adversarial training and robustness in deep neural policies and make the following contributions: • We conduct an investigation on the state-action values learnt by the state-of-the-art adversarially trained deep neural policies and vanilla trained deep neural policies. • We provide theoretically motivated justification for how adversarial training might change the state-action value function. • We perform several experiments in Atari games with large state spaces from the Arcade Learning Environment (ALE). With our systematic analysis we show that vanilla trained deep neural policies have a more accurate representation of the sub-optimal actions compared to the state-of-the-art adversarially trained deep neural policies. • Furthermore, we show the inconsistencies in the action ranking in the state-of-the-art adversarially trained deep neural policies. Thus, these results demonstrate the loss of information in state-action value function as a novel fundamental trade-off intrinsic to adversarial training. • More importantly, we demonstrate that state-of-the-art adversarially trained deep neural policies learn overestimated state-action value functions. • Finally, we explain how our results call into question the hypothesis initially proposed by Bellemare et al. ( 2016) relating the action gap and overestimation.

2. BACKGROUND AND PRELIMINARIES

Preliminaries: In deep reinforcement learning the goal is to learn a policy for taking actions in a Markov Decision Process (MDP) that maximize discounted expected cumulative reward. An MDP is represented by a tuple M = (S, A, P, r, ρ 0 , γ) where S is a set of continuous states, A is a discrete set of actions, P is a transition probability distribution on S × A × S, r : S × A → R is a reward function, ρ 0 is the initial state distribution, and γ is the discount factor. The goal in reinforcement learning is to learn a policy π : S → P(A) which maps states to probability distributions on actions in order to maximize the expected cumulative reward R = E T -1 t=0 γ t r(s t , a t ) where a t ∼ π(s t ). In Q-learning Watkins (1989) the goal is to learn the optimal state-action value function Q * (s, a) = R(s, a) + s ∈S P (s |s, a) max a ∈A Q * (s , a ). Thus, the optimal policy is determined by choosing the action a * (s) = arg max a Q(s, a) in state s. x

Adversarial

N +1 adv = clip (x N adv + αsign(∇ x J(x N adv , y))) in which J(x, y) represents the cost function used to train the deep neural network, x represents the input, and y represents the output labels. While several other methods have been proposed (e.g. Korkmaz (2020)) using a momentum-based extension of the iterative fast gradient sign method,  v t+1 = µ • v t + ∇ sadv J(s t adv + µ • v t , a) ∇ sadv J(s t adv + µ • v t , a) 1 (2) s t+1 adv = s t adv + α • v t+1 v t+1 2



Crafting and Training: Szegedy et al. (2014) observed that imperceptible perturbations could change the decision of a deep neural network and proposed a box constrained optimization method to produce such perturbations. Goodfellow et al. (2015) suggested a faster method to produce such perturbations based on the linearization of the cost function used in training the network. Kurakin et al. (2016) proposed the iterative version of the fast gradient sign method proposed by Goodfellow et al. (2015) inside an -ball.

3) adversarial training has mostly been conducted with perturbations computed by projected gradient descent (PGD) proposed by Madry et al. (2018) (i.e. Equation 1). Adversaries and Training in Deep Neural Policies: The initial investigation on resilience of deep neural policies was conducted by Kos & Song (2017) and Huang et al. (2017) concurrently based on the utilization of the fast gradient sign method proposed byGoodfellow et al. (2015).Korkmaz (2022)   showed that deep reinforcement learning policies learn shared adversarial features across MDPs. While several studies focused on improving optimization techniques to compute optimal perturbations, a line of research focused on making deep neural policies resilient to these perturbations.Mandlekar  et al. (2017)  proposed including these perturbations in training time to increase resilience for robotic

