UNCOVERING DIRECTIONS OF INSTABILITY VIA QUADRATIC APPROXIMATION OF DEEP NEURAL LOSS IN REINFORCEMENT LEARNING

Abstract

Learning in MDPs with highly complex state representations is currently possible due to multiple advancements in reinforcement learning algorithm design. However, this incline in complexity, and furthermore the increase in the dimensions of the observation came at the cost of non-robustness that can be taken advantage of (i.e. moving along worst-case directions in the observation space). To solve this policy instability problem we propose a novel method to ascertain the presence of these non-robust directions via quadratic approximation of the deep neural policy loss. Our method provides a theoretical basis for the fundamental cutoff between stable observations and non-robust observations. Furthermore, our technique is computationally efficient, and does not depend on the methods used to produce the worst-case directions. We conduct extensive experiments in the Arcade Learning Environment with several different non-robust alteration techniques. Most significantly, we demonstrate the effectiveness of our approach even in the setting where alterations are explicitly optimized to circumvent our proposed method.

1. INTRODUCTION

Since Mnih et al. (2015) showed that deep neural networks can be used to parameterize reinforcement learning policies, there has been substantial growth in new algorithms and applications for deep reinforcement learning. While this progress has resulted in a variety of new capabilities for reinforcement learning agents, it has at the same time introduced new challenges due to the non-robustness of deep neural networks to imperceptible adversarial perturbations originally discovered by Szegedy et al. (2014) . In particular, Huang et al. (2017) ; Kos & Song (2017) showed that the non-robustness of neural networks to adversarial perturbations extends to the deep reinforcement learning domain, where applications such as autonomous driving, automatic financial trading or healthcare decision making cannot tolerate such a vulnerability. There has been a significant amount of effort in trying to make deep neural networks robust to adversarial perturbations (Goodfellow et al., 2015; Madry et al., 2018; Pinto et al., 2017) . However, in this arms race it has been shown that deep reinforcement learning policies learn adversarial features independent from their worst-case (i.e. adversarial) training techniques (Korkmaz, 2022) . More intriguingly, a line of work has focused on showing the inevitability of adversarial examples and the intrinsic difficulty of learning robust models (Dohmatob, 2019; Mahloujifar et al., 2019; Gourdeau et al., 2019) . Given that it may not be possible to make DNNs completely robust to adversarial examples, a natural objective is to instead attempt to detect the presence of adversarial manipulations. In this paper we propose a novel identification method for adversarial directions in the deep neural policy manifold. Our study is the first one that focuses on detection of non-robust directions in the deep reinforcement learning neural loss landscape. Our approach relies on differences in the curvature of the neural policy in the neighborhood of an adversarial direction when compared to a baseline state observation. At a high level our method is based on the intuition that while baseline states have neighborhoods determined by an optimization procedure intended to learn a policy that works well across all states, each non-robust direction is the output of some local optimization in the neighborhood of one particular state. Our proposed method is computationally efficient, requiring only one gradient computation and two policy evaluations, requires no training that depends on the method used to compute the adversarial direction, and is theoretically well-founded. Hence, our study focuses on identification of non-robust directions and makes the following contributions: • Our paper is the first to focus on identification of adversarial directions in the deep reinforcement learning policy manifold. • We propose a novel method, Identification of Non-Robust Directions (INRD), to detect adversarial state manipulations based on the local curvature of the neural network policy. INRD is independent of the method used to generate the adversarial direction, computationally efficient, and theoretically justified.  Q(s t , a t ) + α[R t+1 + γ max a Q(s t+1 , a) -Q(s t , a t )]. Adversarial Examples: Goodfellow et al. ( 2015) introduced the fast gradient method (FGM) for producing adversarial examples for image classification. The method is based on taking the gradient of the training cost function J(x, y) with respect to the input image, and bounding the perturbation by where x is the input image and y is the output label. Later, an iterative version of FGM called I-FGM was proposed by Kurakin et al. (2016) . This is also often referred to as Projected Gradient Descent (PGD) as in (Madry et al., 2018) where the I-FGM update is x N +1 adv = clip (x N adv + αsign(∇ x J(x N adv , y))). where x 0 adv = x. Dong et al. ( 2018) further modified I-FGM by introducing a momentum term in the update, yielding a method called MI-FGSM. Korkmaz (2020) later proposed a Nesterovmomentum based approach for the deep reinforcement learning domain. The DeepFool method of Moosavi-Dezfooli et al. ( 2016) is an alternative approach to those based on FGSM. DeepFool performs iterative projection to the closest separating hyperplane between classes. Another alternative approach proposed by Carlini & Wagner (2017a) is based on finding a minimal perturbation that achieves a different target class label. The approach is based on minimizing the loss min s adv ∈S c • J(s adv ) + s adv -s   (2) where s is the clean input, s adv is the adversarial example, and J(s) is a modified version of the cost function used to train the network. (3) Our method focusing on identifying non-robust directions in the deep neural policy manifold is the first method to investigate detection of adversarial manipulations in deep reinforcement learning. Our identification method does not require modifying the training of the neural network, does not require any training tailored to the adversarial method used, and uses only two neural network function evaluations and one gradient computation.



Chen et al. (2018)  proposed a variant of the Carlini & Wagner (2017a) formulation that adds an 1 -regularization term to produce sparser adversarial examples, mins adv ∈S c • J(s adv ) + λ  s adv -s  + λ  s adv -s  

