THE ADVERSARIAL REGULATION OF THE TEMPORAL DIFFERENCE LOSS COSTS MORE THAN EXPECTED Anonymous

Abstract

Deep reinforcement learning research has enabled reaching significant performance levels for sequential decision making in MDPs with highly complex observations and state dynamics with the aid of deep neural networks. However, this aid came with a cost that is inherent to deep neural networks which have increased sensitivities towards indistinguishable peculiarly crafted non-robust directions. To alleviate these sensitivities several studies suggested techniques to cope with this problem via explicitly regulating the temporal difference loss for the worst-case sensitivity. In our study, we show that these worst-case regularization techniques come with a cost that intriguingly causes inconsistencies and overestimations in the stateaction value functions. Furthermore, our results essentially demonstrate that vanilla trained deep reinforcement learning policies have more accurate and consistent estimates for the state-action values. We believe our results reveal foundational intrinsic properties of the adversarial training techniques and demonstrate the need to rethink the approach to robustness in deep reinforcement learning.

1. INTRODUCTION

Advancements in deep neural networks have recently proliferated leading to expansion in the domains where deep neural networks are utilized including image classification (Krizhevsky et al., 2012) , natural language processing (Sutskever et al., 2014 ), speech recognition (Hannun et al., 2014) and self learning systems via exploration. In particular, deep reinforcement learning has become an emerging field with the introduction of deep neural networks as function approximators (Mnih et al., 2015) . Hence, deep neural policies have been deployed in many different domains from pharmaceuticals to self driving cars (Daochang & Jiang, 2018; Huan-Hsin et al., 2017; Noonan, 2017) . As the advancements in deep neural networks continued a line of research focused on their vulnerabilities towards a certain type of specifically crafted perturbations computed via the cost function used to train the neural network (Szegedy et al., 2014; Goodfellow et al., 2015; Madry et al., 2018; Kurakin et al., 2016; Dong et al., 2018) . While some research focused on producing optimal p -norm bounded perturbations to cause the most possible damage to the deep neural network models, an extensive amount of work focused on making the networks robust to such perturbations (Madry et al., 2018; Carmon et al., 2019; Raghunathan et al., 2020) . The vulnerability to such particularly optimized adversarial directions was inherited by deep neural policies as well (Huang et al., 2017; Kos & Song, 2017; Korkmaz, 2022) . Thus, robustness to such perturbations in deep reinforcement learning became a concern for the machine learning community, and several studies proposed various methods to increase robustness (Pinto et al., 2017; Gleave et al., 2020) . Thus, in this paper we focus on adversarially trained deep neural policies and the state-action value function learned by these training methods in the presence of an adversary. In more detail, in this paper we aim to seek answers for the following questions: (i) How accurate is the state-action value function on estimating the values for state-action pairs in MDPs with high dimensional state representations?, (ii) Does adversarial training affect the estimates of the state-action value function?, (iii) What are the effects of training with worst-case distributional shift on the state-action value function representation for the optimal actions? and (iv) Are there any fundamental trade-offs intrinsic to explicit worst-case regularization in deep neural policy training? To be able to answer these questions we focus on adversarial training and robustness in deep neural policies and make the following contributions: 1

