ROBUST REINFORCEMENT LEARNING ON STATE OB-SERVATIONS WITH LEARNED OPTIMAL ADVERSARY

Abstract

We study the robustness of reinforcement learning (RL) with adversarially perturbed state observations, which aligns with the setting of many adversarial attacks to deep reinforcement learning (DRL) and is also important for rolling out real-world RL agent under unpredictable sensing noise. With a fixed agent policy, we demonstrate that an optimal adversary to perturb state observations can be found, which is guaranteed to obtain the worst case agent reward. For DRL settings, this leads to a novel empirical adversarial attack to RL agents via a learned adversary that is much stronger than previous ones. To enhance the robustness of an agent, we propose a framework of alternating training with learned adversaries (ATLA), which trains an adversary online together with the agent using policy gradient following the optimal adversarial attack framework. Additionally, inspired by the analysis of state-adversarial Markov decision process (SA-MDP), we show that past states and actions (history) can be useful for learning a robust agent, and we empirically find a LSTM based policy can be more robust under adversaries. Empirical evaluations on a few continuous control environments show that ATLA achieves state-of-the-art performance under strong adversaries. Our code is available at https://github.com/huanzhang12/ATLA_robust_RL.

1. INTRODUCTION

Modern deep reinforcement learning agents (Mnih et al., 2015; Levine et al., 2015; Lillicrap et al., 2015; Silver et al., 2016; Fujimoto et al., 2018) typically use neuron networks as function approximators. Since the discovery of adversarial examples in image classification tasks (Szegedy et al., 2013) , the vulnerabilities in DRL agents were first demonstrated in (Huang et al., 2017; Lin et al., 2017; Kos & Song, 2017) and further developed under more environments and different attack scenarios (Behzadan & Munir, 2017a; Pattanaik et al., 2018; Xiao et al., 2019) . These attacks commonly add imperceptible noises into the observations of states, e.g., the observed environment slightly differs from true environment. This raises concerns for using RL in safety-crucial applications such as autonomous driving (Sallab et al., 2017; Voyage, 2019) ; additionally, the discrepancy between ground-truth states and agent observations also contributes to the "reality gap" -an agent working well in simulated environments may fail in real environments due to noises in observations (Jakobi et al., 1995; Muratore et al., 2019) , as real-world sensing contains unavoidable noise (Brooks, 1992) . We classify the weakness of a DRL agent on the perturbations of state observations into two classes: the vulnerability in function approximators, which typically originates from the highly non-linear and blackbox nature of neural networks; and intrinsic weakness of policy: even perfect features for states are extracted, an agent can still make mistakes due to an intrinsic weakness in its policy. For example, in the deep Q networks (DQNs) for Atari games, a large convolutional neural network (CNN) is used for extracting features from input frames. To act correctly, the network must extract crucial features: e.g., for the game of Pong, the position and velocity of the ball, which can observed by visualizing convolutional layers (Hausknecht & Stone, 2015; Guo et al., 2014) . Many attacks to the DQN setting add imperceptible noises (Huang et al., 2017; Lin et al., 2017; Kos & Song, 2017; Behzadan & Munir, 2017a ) that exploit the vulnerability of deep neural networks so that they extract wrong features, as we have seen in adversarial examples of image classification tasks. On the other Figure 1 : We show an agent in gridworld environment trained with no function approximators, and its optimal policy is intrinsically not robust to perturbations of state observations. The red square and blue circle are the starting point and target (reward +1) of the agent, respectively. The green triangles are traps, with reward -1 once encountered. The adversary is allowed to perturb the observation to adjacent states along four directions: up, down, left, and right. Adversary earns +1 at traps and -1 at the target. We set γ = 0.9 for both agent and adversary. This example shows that the vulnerability of a RL agent does not only come from the errors in function approximators such as DNNs. hand, the fragile function approximation is not the only source of the weakness of a RL agentin a finite-state Markov decision process (MDP), we can use tabular policy and value functions so there is no function approximation error. The agent can still be vulnerable to small perturbations on observations, e.g., perturbing the observation of a state to one of its four neighbors in a gridworldlike environment can prevent an agent from reaching its goal (Figure 1 ). To improve the robustness of RL, we need to take measures from both aspects -a more robust function approximator, and a policy aware of perturbations in observations. Techniques developed in enhancing the robustness of neural network (NN) classifiers can be applied to address the vulnerability in function approximators. Especially, for environments like Atari games with images as input and discrete actions as outputs, the policy network π θ behaves similarly to a classifier in test time. Thus, Fischer et al. (2019) ; Mirman et al. (2018a) utilized existing certified adversarial defense (Mirman et al., 2018b; Wong & Kolter, 2018; Gowal et al., 2018; Zhang et al., 2020a) approaches in supervised learning to enhance the robustness of DQN agents. Another successful approach (Zhang et al., 2020b) for both Atari and high-dimensional continuous control environment regularizes the smoothness of the learned policy such that max ŝ∈B(s) D(π θ (s), π θ (ŝ)) is small for some divergence D and B(s) is a neighborhood around s. This maximization can be solved using a gradient based method or convex relaxations of NNs (Salman et al., 2019; Zhang et al., 2018; Xu et al., 2020) , and then minimized by optimizing θ. Such an adversarial minimax regularization is in the same spirit as the ones used in some adversarial training approaches for (semi-)supervised learning, e.g., TRADES (Zhang et al., 2019) and VAT (Miyato et al., 2015) . However, regularizing the function approximators does not explicitly improve the intrinsic policy robustness. In this paper, we propose an orthogonal approach, alternating training with learned adversaries (ATLA), to enhance the robustness of DRL agents. We focus on dealing with the intrinsic weakness of the policy by learning an adversary online with the agent during training time, rather than directly regularizing function approximators. Our main contributions can be summarized as: • We follow the framework of state-adversarial Markov decision process (SA-MDP) and show how to learn an optimal adversary for perturbing observations. We demonstrate practical attacks under this formulation and obtain learned adversaries that are significantly stronger than previous ones. • We propose the alternating training with learned adversaries (ATLA) framework to improve the robustness of DRL agents. The difference between our approach and previous adversarial training approaches is that we use a stronger adversary, which is learned online together with the agent. • Our analysis on SA-MDP also shows that history can be important for learning a robust agent. We thus propose to use a LSTM based policy in the ATLA framework and find that it is more robust than policies parameterized as regular feedforward NNs.



(a) Path in unperturbed environment (found by policy iteration). Agent's reward = +1. Black arrows and numbers show actions and value function of the agent. (b) Path under the optimal adversary. Agent's reward = -∞. Red arrows and numbers show actions and value function of the optimal adversary (Section 3.1).(c) A robust POMDP policy solved by SARSOP(Kurniawati  et al., 2008)  under the same adversary. This policy is history dependent (Section 3.2).

