REVEALING DOMINANT EIGENDIRECTIONS VIA SPEC-TRAL NON-ROBUSTNESS ANALYSIS IN THE DEEP RE-INFORCEMENT LEARNING POLICY MANIFOLD

Abstract

Deep neural policies have recently been installed in a diverse set of settings, from biotechnology to automated financial systems. However, the utilization of deep neural networks to approximate the state-action value function commences concerns on the decision boundary stability, in particular, with regard to the sensitivity of policy decision making to indiscernible, non-robust features due to highly nonconvex and complex deep neural manifolds. These concerns constitute an obstruction to understanding the reasoning made by deep neural policies, and their foundational limitations. Thus, it is crucial to develop techniques that aim to understand the sensitivities in the learnt representations of neural network policies. To achieve this we introduce a method that identifies the dominant eigen-directions via spectral analysis of non-robust directions in the deep neural policy decision boundary across both time and space. Through experiments in the Arcade Learning Environment (ALE), we demonstrate the effectiveness of our spectral analysis algorithm for identifying correlated non-robust directions, and for measuring how sample shifts remold the set of sensitive directions in the neural policy landscape. Most importantly, we show that state-of-the-art adversarial training techniques yield learning of sparser high-sensitivity directions, with dramatically larger oscillations over time, when compared to standard training. We believe our results reveal the fundamental properties of the decision process made by the deep reinforcement learning policies, and can help in constructing safe, reliable and valuealigned deep neural policies.

1. INTRODUCTION

Reinforcement learning algorithms leveraging the power of deep neural networks have obtained state-of-the-art results initially in game-playing tasks Mnih et al. (2015) and subsequently in continuous control Lillicrap et al. (2015) . Since this initial success, there has been a continuous stream of developments both of new algorithms Mnih et al. ( 2016 In our paper we focus on understanding the learned representations and policy vulnerabilities and ask the following questions: (i) How do non-robust directions on the deep neural policy manifold interact with each other temporally and spatially? (ii) Do the non-robust features learnt by the deep reinforcement learning algorithms transform under adversarial attacks? (iii) How do these learnt correlated non-robust features change with distributional shift? (iv) Does adversarial training solve the problem of learning correlated non-robust features? To be able to answer these questions in our paper we focus on understanding the representations learned by deep reinforcement learning policies and make the following contributions: • We propose a novel tracing algorithm to analyze the spatially and temporally correlated vulnerabilities of deep reinforcement learning policies. • We conduct several experiments in the Arcade Learning Environment with policies trained in high-dimensional state representations. • We go over several benchmarked adversarial attack techniques and show how these attacks affect the vulnerable learned representations. • We inspect the effects of distributional shift on the correlated non-robust feature patterns learned by deep reinforcement learning policies. • Finally, we investigate the presence of non-robust features in adversarially trained deep neural policies, and show that the state-of-the-art adversarial training method leads to learning sparser and spikier vulnerable representations.

2. BACKGROUND AND PRELIMINARIES 2.1 PRELIMINARIES

A Markov Decision Process (MDP) is defined by a tuple (S, A, P, R) where S is a set of states, A is a set of actions, P : S × A × S → [0, 1] is a Markov transition kernel, R : S × A × S → R is a reward function, and γ is a discount factor. A reinforcement learning agent interacts with an MDP by observing the current state s ∈ S and taking an action a ∈ A. The agent then transitions to state s with probability P(s, a, s ) and receives reward R(s, a, s ). A policy π : S × A → [0, 1] selects action a in state s with probability π(s, a). The main objective in reinforcement learning is to learn a policy π which maximizes the expected cumulative discounted rewards R = E at∼π(st,•) t γ t R(s t , a t , s t+1 ). This maximization is achieved by iterative Bellman update Q(s t , a t ) = R(s t , a t ) + γ st P(s t+1 |s t , a t )V (s t+1 ) Hence, the optimal policy π * (s, a) can be achieved by executing the action a * (s) = arg max a Q(s, a) i.e. the action maximizing the state-action value function in state s.

2.2. ADVERSARIAL PERTURBATION TECHNIQUES AND FORMULATIONS

Following the initial study conducted by Szegedy et al. (2014 ), Goodfellow et al. (2015) proposed a fast and efficient way to produce -bounded adversarial perturbations in image classification based on linearization of J(x, y), the cost function used to train the network, at data point x with label y. Consequently, Kurakin et al. (2016) proposed the iterative form of this algorithm named iterative fast gradient sign method (I-FGSM). x N +1 adv = clip (x N adv + αsign(∇ x J(x N adv , y))). (2) This algorithm further has been improved by the proposal of the utilization of the momentum term by Dong et al. (2018) . Following this Ezgi (2020) proposed a Nesterov momentum technique to compute -bounded adversarial perturbations for deep reinforcement learning policies by computing



), Hasselt et al. (2016), Wang et al. (2016), and striking new performance records in highly complex tasks Silver et al. (2017). While the field of deep reinforcement learning has developed rapidly, the understanding of the representations learned by deep neural network policies has lagged behind. The lack of understanding of deep neural policies is of critical importance in the context of the sensitivities of policy decisions to imperceptible, non-robust features. Beginning with the work of Szegedy et al. (2014), Goodfellow et al. (2015), deep neural networks have been shown to be vulnerable to adversarial perturbations below the level of human perception. In response, a line of work has focused on proposing training techniques to increase robustness by applying these perturbations to the input of deep neural networks during training time (i.e. adversarial training) Goodfellow et al. (2015) Madry et al. (2017). However, there are also concerns about adversarial training including decreased accuracy on clean data Bhagoji et al. (2019), prohibiting generalization (e.g. a performance gap between adversarial and standard training that increases as the size of the dataset grows Chen et al. (2020)), and incorrect invariance to semantically meaningful changes Tramèr et al. (2020).

