UNCOVERING DIRECTIONS OF INSTABILITY VIA QUADRATIC APPROXIMATION OF DEEP NEURAL LOSS IN REINFORCEMENT LEARNING

Abstract

Learning in MDPs with highly complex state representations is currently possible due to multiple advancements in reinforcement learning algorithm design. However, this incline in complexity, and furthermore the increase in the dimensions of the observation came at the cost of non-robustness that can be taken advantage of (i.e. moving along worst-case directions in the observation space). To solve this policy instability problem we propose a novel method to ascertain the presence of these non-robust directions via quadratic approximation of the deep neural policy loss. Our method provides a theoretical basis for the fundamental cutoff between stable observations and non-robust observations. Furthermore, our technique is computationally efficient, and does not depend on the methods used to produce the worst-case directions. We conduct extensive experiments in the Arcade Learning Environment with several different non-robust alteration techniques. Most significantly, we demonstrate the effectiveness of our approach even in the setting where alterations are explicitly optimized to circumvent our proposed method.

1. INTRODUCTION

Since Mnih et al. (2015) showed that deep neural networks can be used to parameterize reinforcement learning policies, there has been substantial growth in new algorithms and applications for deep reinforcement learning. While this progress has resulted in a variety of new capabilities for reinforcement learning agents, it has at the same time introduced new challenges due to the non-robustness of deep neural networks to imperceptible adversarial perturbations originally discovered by Szegedy et al. (2014) . In particular, Huang et al. (2017) ; Kos & Song (2017) showed that the non-robustness of neural networks to adversarial perturbations extends to the deep reinforcement learning domain, where applications such as autonomous driving, automatic financial trading or healthcare decision making cannot tolerate such a vulnerability. There has been a significant amount of effort in trying to make deep neural networks robust to adversarial perturbations (Goodfellow et al., 2015; Madry et al., 2018; Pinto et al., 2017) . However, in this arms race it has been shown that deep reinforcement learning policies learn adversarial features independent from their worst-case (i.e. adversarial) training techniques (Korkmaz, 2022). More intriguingly, a line of work has focused on showing the inevitability of adversarial examples and the intrinsic difficulty of learning robust models (Dohmatob, 2019; Mahloujifar et al., 2019; Gourdeau et al., 2019) . Given that it may not be possible to make DNNs completely robust to adversarial examples, a natural objective is to instead attempt to detect the presence of adversarial manipulations. In this paper we propose a novel identification method for adversarial directions in the deep neural policy manifold. Our study is the first one that focuses on detection of non-robust directions in the deep reinforcement learning neural loss landscape. Our approach relies on differences in the curvature of the neural policy in the neighborhood of an adversarial direction when compared to a baseline state observation. At a high level our method is based on the intuition that while baseline states have neighborhoods determined by an optimization procedure intended to learn a policy that works well across all states, each non-robust direction is the output of some local optimization in the neighborhood of one particular state. Our proposed method is computationally efficient, requiring only one gradient computation and two policy evaluations, requires no training that depends on the

