REVEALING DOMINANT EIGENDIRECTIONS VIA SPEC-TRAL NON-ROBUSTNESS ANALYSIS IN THE DEEP RE-INFORCEMENT LEARNING POLICY MANIFOLD

Abstract

Deep neural policies have recently been installed in a diverse set of settings, from biotechnology to automated financial systems. However, the utilization of deep neural networks to approximate the state-action value function commences concerns on the decision boundary stability, in particular, with regard to the sensitivity of policy decision making to indiscernible, non-robust features due to highly nonconvex and complex deep neural manifolds. These concerns constitute an obstruction to understanding the reasoning made by deep neural policies, and their foundational limitations. Thus, it is crucial to develop techniques that aim to understand the sensitivities in the learnt representations of neural network policies. To achieve this we introduce a method that identifies the dominant eigen-directions via spectral analysis of non-robust directions in the deep neural policy decision boundary across both time and space. Through experiments in the Arcade Learning Environment (ALE), we demonstrate the effectiveness of our spectral analysis algorithm for identifying correlated non-robust directions, and for measuring how sample shifts remold the set of sensitive directions in the neural policy landscape. Most importantly, we show that state-of-the-art adversarial training techniques yield learning of sparser high-sensitivity directions, with dramatically larger oscillations over time, when compared to standard training. We believe our results reveal the fundamental properties of the decision process made by the deep reinforcement learning policies, and can help in constructing safe, reliable and valuealigned deep neural policies.

1. INTRODUCTION

Reinforcement learning algorithms leveraging the power of deep neural networks have obtained state-of-the-art results initially in game-playing tasks Mnih et al. (2015) and subsequently in continuous control Lillicrap et al. (2015) . Since this initial success, there has been a continuous stream of developments both of new algorithms Mnih et al. ( 2016 



), Hasselt et al. (2016), Wang et al. (2016), and striking new performance records in highly complex tasks Silver et al. (2017). While the field of deep reinforcement learning has developed rapidly, the understanding of the representations learned by deep neural network policies has lagged behind. The lack of understanding of deep neural policies is of critical importance in the context of the sensitivities of policy decisions to imperceptible, non-robust features. Beginning with the work of Szegedy et al. (2014), Goodfellow et al. (2015), deep neural networks have been shown to be vulnerable to adversarial perturbations below the level of human perception. In response, a line of work has focused on proposing training techniques to increase robustness by applying these perturbations to the input of deep neural networks during training time (i.e. adversarial training) Goodfellow et al. (2015) Madry et al. (2017). However, there are also concerns about adversarial training including decreased accuracy on clean data Bhagoji et al. (2019), prohibiting generalization (e.g. a performance gap between adversarial and standard training that increases as the size of the dataset grows Chen et al. (2020)), and incorrect invariance to semantically meaningful changes Tramèr et al. (2020).

