ROBUST REINFORCEMENT LEARNING ON STATE OB-SERVATIONS WITH LEARNED OPTIMAL ADVERSARY

Abstract

We study the robustness of reinforcement learning (RL) with adversarially perturbed state observations, which aligns with the setting of many adversarial attacks to deep reinforcement learning (DRL) and is also important for rolling out real-world RL agent under unpredictable sensing noise. With a fixed agent policy, we demonstrate that an optimal adversary to perturb state observations can be found, which is guaranteed to obtain the worst case agent reward. For DRL settings, this leads to a novel empirical adversarial attack to RL agents via a learned adversary that is much stronger than previous ones. To enhance the robustness of an agent, we propose a framework of alternating training with learned adversaries (ATLA), which trains an adversary online together with the agent using policy gradient following the optimal adversarial attack framework. Additionally, inspired by the analysis of state-adversarial Markov decision process (SA-MDP), we show that past states and actions (history) can be useful for learning a robust agent, and we empirically find a LSTM based policy can be more robust under adversaries. Empirical evaluations on a few continuous control environments show that ATLA achieves state-of-the-art performance under strong adversaries. Our code is available at https://github.com/huanzhang12/ATLA_robust_RL.

1. INTRODUCTION

Modern deep reinforcement learning agents (Mnih et al., 2015; Levine et al., 2015; Lillicrap et al., 2015; Silver et al., 2016; Fujimoto et al., 2018) typically use neuron networks as function approximators. Since the discovery of adversarial examples in image classification tasks (Szegedy et al., 2013) , the vulnerabilities in DRL agents were first demonstrated in (Huang et al., 2017; Lin et al., 2017; Kos & Song, 2017) and further developed under more environments and different attack scenarios (Behzadan & Munir, 2017a; Pattanaik et al., 2018; Xiao et al., 2019) . These attacks commonly add imperceptible noises into the observations of states, e.g., the observed environment slightly differs from true environment. This raises concerns for using RL in safety-crucial applications such as autonomous driving (Sallab et al., 2017; Voyage, 2019) ; additionally, the discrepancy between ground-truth states and agent observations also contributes to the "reality gap" -an agent working well in simulated environments may fail in real environments due to noises in observations (Jakobi et al., 1995; Muratore et al., 2019) , as real-world sensing contains unavoidable noise (Brooks, 1992) . We classify the weakness of a DRL agent on the perturbations of state observations into two classes: the vulnerability in function approximators, which typically originates from the highly non-linear and blackbox nature of neural networks; and intrinsic weakness of policy: even perfect features for states are extracted, an agent can still make mistakes due to an intrinsic weakness in its policy. For example, in the deep Q networks (DQNs) for Atari games, a large convolutional neural network (CNN) is used for extracting features from input frames. To act correctly, the network must extract crucial features: e.g., for the game of Pong, the position and velocity of the ball, which can observed by visualizing convolutional layers (Hausknecht & Stone, 2015; Guo et al., 2014) . Many attacks to the DQN setting add imperceptible noises (Huang et al., 2017; Lin et al., 2017; Kos & Song, 2017; Behzadan & Munir, 2017a ) that exploit the vulnerability of deep neural networks so that they extract wrong features, as we have seen in adversarial examples of image classification tasks. On the other (c) A robust POMDP policy solved by SARSOP (Kurniawati et al., 2008) under the same adversary. This policy is history dependent (Section 3.2). Figure 1 : We show an agent in gridworld environment trained with no function approximators, and its optimal policy is intrinsically not robust to perturbations of state observations. The red square and blue circle are the starting point and target (reward +1) of the agent, respectively. The green triangles are traps, with reward -1 once encountered. The adversary is allowed to perturb the observation to adjacent states along four directions: up, down, left, and right. Adversary earns +1 at traps and -1 at the target. We set γ = 0.9 for both agent and adversary. This example shows that the vulnerability of a RL agent does not only come from the errors in function approximators such as DNNs. hand, the fragile function approximation is not the only source of the weakness of a RL agentin a finite-state Markov decision process (MDP), we can use tabular policy and value functions so there is no function approximation error. The agent can still be vulnerable to small perturbations on observations, e.g., perturbing the observation of a state to one of its four neighbors in a gridworldlike environment can prevent an agent from reaching its goal (Figure 1 ). To improve the robustness of RL, we need to take measures from both aspects -a more robust function approximator, and a policy aware of perturbations in observations. Techniques developed in enhancing the robustness of neural network (NN) classifiers can be applied to address the vulnerability in function approximators. Especially, for environments like Atari games with images as input and discrete actions as outputs, the policy network π θ behaves similarly to a classifier in test time. Thus, Fischer et al. (2019) ; Mirman et al. (2018a) utilized existing certified adversarial defense (Mirman et al., 2018b; Wong & Kolter, 2018; Gowal et al., 2018; Zhang et al., 2020a) approaches in supervised learning to enhance the robustness of DQN agents. Another successful approach (Zhang et al., 2020b) for both Atari and high-dimensional continuous control environment regularizes the smoothness of the learned policy such that max ŝ∈B(s) D(π θ (s), π θ (ŝ)) is small for some divergence D and B(s) is a neighborhood around s. This maximization can be solved using a gradient based method or convex relaxations of NNs (Salman et al., 2019; Zhang et al., 2018; Xu et al., 2020) , and then minimized by optimizing θ. Such an adversarial minimax regularization is in the same spirit as the ones used in some adversarial training approaches for (semi-)supervised learning, e.g., TRADES (Zhang et al., 2019) and VAT (Miyato et al., 2015) . However, regularizing the function approximators does not explicitly improve the intrinsic policy robustness. In this paper, we propose an orthogonal approach, alternating training with learned adversaries (ATLA), to enhance the robustness of DRL agents. We focus on dealing with the intrinsic weakness of the policy by learning an adversary online with the agent during training time, rather than directly regularizing function approximators. Our main contributions can be summarized as: • We follow the framework of state-adversarial Markov decision process (SA-MDP) and show how to learn an optimal adversary for perturbing observations. We demonstrate practical attacks under this formulation and obtain learned adversaries that are significantly stronger than previous ones. • We propose the alternating training with learned adversaries (ATLA) framework to improve the robustness of DRL agents. The difference between our approach and previous adversarial training approaches is that we use a stronger adversary, which is learned online together with the agent. • Our analysis on SA-MDP also shows that history can be important for learning a robust agent. We thus propose to use a LSTM based policy in the ATLA framework and find that it is more robust than policies parameterized as regular feedforward NNs. • We evaluate our approach empirically on four continuous control environments. We outperform explicit regularization based methods in a few environments, and our approach can also be directly combined with explicit regularizations on function approximators to achieve state-of-the-art results.

2. RELATED WORK

State-adversarial Markov decision process (SA-MDP) (Zhang et al., 2020b) characterizes the decision making problem under adversarial attacks on state observations. Most importantly, the true state in the environment is not perturbed by the adversary under this setting; for example, perturbing pixels in an Atari environment (Huang et al., 2017; Kos & Song, 2017; Lin et al., 2017; Behzadan & Munir, 2017a; Inkawhich et al., 2019) does not change the true location of an object in the game simulator. SA-MDP can characterize agent performance under natural or adversarial noise from sensor measurements. For example, GPS sensor readings on a car are naturally noisy, but the ground truth location of the car is not affected by the noise. Importantly, this setting is different from robust Markov decision process (RMDP) (Nilim & El Ghaoui, 2004; Iyengar, 2005) , where the worst case transition probabilities of the environment are considered. "Robust reinforcement learning" in some works (Mankowitz et al., 2018; 2019) refer to this different definition of robustness in RMDP, and should not be confused with our setting of robustness against perturbations on state observations. Several works proposed methods to learn an adversary online together with an agent. RARL (Pinto et al., 2017) proposed to train an agent and an adversary under the two-player Markov game (Littman, 1994) setting. The adversary can change the environment states through actions directly applied to environment. The goal of RARL is to improve the robustness against environment parameter changes, such as mass, length or friction. Gleave et al. (2019) discussed the learning of an adversary using reinforcement learning to attack a victim agent, by taking adversarial actions that changes the environment and consequentially change the observation of the victim agent. Both Pinto et al. (2017) ; Gleave et al. (2019) conduct their attack under on the two-player Markov game framework, rather than considering perturbations on state observations. Besides, Li et al. (2019) consider a similar Markov game setting in multi-agent RL environments. The difference between these works and ours can be clearly seen in the setting where the adversary is fixed -under the framework of (Pinto et al., 2017; Gleave et al., 2019) , the learning of agent is still a MDP, but in our setting, it becomes a harder POMDP problem (Section 3.2). Training DRL agents with perturbed state observations from adversaries have been investigated in a few works, sometimes referred to as adversarial training. Kos & Song (2017) ; Behzadan & Munir (2017b) used gradient based adversarial attacks to DQN agents and put adversarial frames into replay buffer. This approach is not very successful because for Atari environments the main source of weakness is likely to come from the function approximator, so an adversarial regularization framework such as (Zhang et al., 2020b; Qu et al., 2020) which directly controls the smoothness of the Q function is more effective. For lower dimensional continuous control tasks such as the MuJoCo environments, Mandlekar et al. (2017) ; Pattanaik et al. (2018) conducted FGSM and multistep gradient based attacks during training time; however, their main focus was on the robustness against environment parameter changes and only limited evaluation on the adversarial attack setting was conducted with relatively weak adversaries. Zhang et al. (2020b) systematically tested this approach under newly proposed strong attacks, and found that it cannot reliably improve robustness. These early adversarial training approaches typically use gradients from a critic function. They are usually relatively weak, and not sufficient to lead to a robust policy under stronger attacks. The robustness of RL has also been investigated from other perspectives. For example, Tessler et al. (2019) study MDPs under action perturbations; Tan et al. (2020) use adversarial training on action space to enhance agent robustness under action perturbations. Besides, policy teaching (Zhang & Parkes, 2008; Zhang et al., 2009; Ma et al., 2019) and policy poisoning (Rakhsha et al., 2020; Huang & Zhu, 2019) manipulate the reward or cost signal during agent training time to induce a desired agent policy. Essentially, policy teaching is a training time "attack" with perturbed rewards from the environments (which can be analogous to data poisoning attacks in supervised learning settings), while our goal is to obtain a robust agent against test time adversarial attacks. All these settings differ from the setting of perturbing state observations discussed in our paper.

3. METHODOLOGY

In this section, we first discuss the case where the agent policy is fixed, and then the case where the adversary is fixed in SA-MDPs. This allows us to propose an alternating training framework to improve robustness of RL agents under perturbations on state observations. Notations and Background We use S and A to represent the state space and the action space, respectively; P(S) defines the set of all possible probability measures on S. We define a Markov decision process (MDP) as (S, A, R, p, γ), where R : S × A × S → R and p : S × A → P(S) are two mappings represent the reward and transition probability. The transition probability at time step t can be written as p(s |s, a) = Pr(s t+1 = s |s t = s, a t = a). Reward function is defined as the expected reward R(s, a, s ) := E[r t |s t = s, a t = a, s t+1 = s ]. γ ∈ [0, 1] is the discounting factor. We denote a stationary policy as π : S → P(A) which is independent of history. We denote history h t at time t as {s 0 , a 0 , • • • , s t-1 , a t-1 , s t } and H as the set of all histories. A history-dependent policy is defined as π : H → P(A). A partially observable Markov decision process (Astrom, 1965) (POMDP) can be defined as a 7-tuple (S, A, O, Ω, R, p, γ) where O is a set of observations and Ω is a set of conditional observation probabilities p(o|s). Unlike MDPs, POMDPs typically require history-dependent optimal policies. To study the decision problem under adversaries on state observations, we use state-adversarial Markov decision process (SA-MDP) framework (Zhang et al., 2020b) . In SA-MDP, an adversary ν : S → P(S) is introduced to perturb the input state of an agent; however, the true environment state s is unchanged (Figure 2 ). Formally, an SA-MDP is a 6-tuple (S, A, B, R, p, γ) where B is a mapping from a state s ∈ S to a set of states B(s) ∈ S. The agent sees the perturbed state ŝ ∼ ν(•|s) and takes the action π(a|ŝ) accordingly. B limits the power of adversary: supp (ν(•|s)) ∈ B(s). The goal of SA-MDP is to solve an optimal policy π * under its optimal adversary ν * (π * ); an optimal adversary is defined as ν * (π) such that π achieves the lowest possible expected discounted return (or value) on all states. Zhang et al. (2020b) did not give an explicit algorithm to solve SA-MDP and found that a stationary optimal policy need not exist.

3.1. FINDING THE OPTIMAL ADVERSARY UNDER A FIXED POLICY

In this section, we discuss how to find an optimal adversary ν for a given policy π. An optimal adversary leads to the worst case performance under bounded perturbation set B, and is an absolute lower bound of the expected cumulative reward an agent can receive. It is similar to the concept of "minimal adversarial example" in supervised learning tasks. We first show how to solve the optimal adversary in MDP setting and then apply it to the DRL settings. A technical lemma (Lemma 1) from Zhang et al. (2020b) shows that, from the adversary's point of view, a fixed and stationary agent policy π and the environment dynamics can be essentially merged into an MDP with redefined dynamics and reward functions: Lemma 1 (Zhang et al. (2020b) ) Given an SA-MDP M = (S, A, R, B, p, γ) and a fixed and stationary policy π(•|•), there exists an MDP M = (S, Â, R, p, γ) such that the optimal policy of M is the optimal adversary ν for SA-MDP given the fixed π, where Â = S, and R(s, â, s ) := E[r|s, â, s ] = -a∈A π(a|â)p(s |s,a)R(s,a,s ) a∈A π(a|â)p(s |s,a) for s, s ∈ S and â ∈ B(s) ⊂ Â, C for s, s ∈ S and â / ∈ B(s). where C is a large negative constant, and p(s |s, â) = a∈A π(a|â)p(s |s, a) for s, s ∈ S and â ∈ Â. The intuition behind Lemma 1 is that the adversary's goal is to reduce the reward earned by the agent. Thus, when a reward r t is received by the agent at time step t, the adversary receives a negative reward of rt = -r t . To prevent the agent from taking actions outside of set B(s), a large negative reward C is assigned to these actions such that the optimal adversary cannot take them. For actions within B(s), we calculate R(s, â, s ) by its definition, R(s, â, s ) := E[r|s, â, s ] which yields the term in Lemma 1. The proof can be found in Appendix B of Zhang et al. (2020b) . After constructing the MDP M , it is possible to solve an optimal agent ν of M , which will be the optimal adversary on SA-MDP M given policy π. For MDPs, under mild regularity assumptions an optimal policy always exists (Puterman, 2014). In our case, the optimal policy on M corresponds to an optimal adversary in SA-MDP, which is the worst case perturbation for policy π. As an illustration, in Figure 1 , we show a GridWorld environment. The red square is the starting point. The blue circle and green triangles are the target and traps, respectively. When the agent hits the target, it earns reward +1 and the game stops and it earns reward -1 when it encounters a trap. We set γ = 0.9 for both agent and adversary. The adversary is allowed to perturb the observation to adjacent cells along four directions: up, down, left, and right. When there is no adversary, after running policy iteration, the agent can easily reach target and earn a reward of +1, as in Figure 1a . However, if we train the adversary based on Lemma 1 and apply it to the agent, we are able to make the agent repeatedly encounter a trap. This leads to -∞ reward for the agent and +∞ reward for the adversary as shown in Figure 1b . We now extend this Lemma 1 to the DRL setting. Since the learning of adversary is equivalent to solving an MDP, we parameterize the adversary as a neural network function and use any popular DRL algorithm to learn an "optimal" adversary. Here we quote the word "optimal" as we use function approximator to learn the agent so it's no longer optimal, but we emphasize that it follows the SA-MDP framework of solving an optimal adversary. No existing adversarial attacks follow such a theoretically guided framework. We show our algorithm in Algorithm 1. Instead of learning to produce ŝ ∈ B(s) directly, since B(s) is usually a small set nearby s (e.g., B = {s | s-s p ≤ }, our adversary learns a perturbation vector ∆, and we project s + ∆ to B(s). The first advantage of attacking a policy in this way is that it is strong -as we allow to optimize the adversary in an online loop of interactions with the agent policy and environment, and keep improving the adversary with a goal of receiving as less reward as possible. It is strong because it follows the theoretical framework of finding an optimal adversary, rather than using any heuristic to generate a perturbation. Empirically, in the cases demonstrated in Figure 3 , previous strong attacks (e.g., Robust Sarsa attack) can successfully fail an agent and make it stop moving and receive a small positive reward; our learned attack can trick the agent into moving toward the opposite direction of the goal and receive a large negative reward. We also find that this attack can further reduce the reward of robustly trained agents, like SA-PPO (Zhang et al., 2020b) . The second advantage of this attack is that it requires no gradient access to the policy itself; in fact, it treats the agent as part of the environment and only needs to run it in a blackbox. Previous attacks (e.g., Lin et al. (2017) ; Pattanaik et al. (2018) ; Xiao et al. ( 2019)) are mostly gradient based approach and need to access the values or gradients to a policy or value function. Even without access to gradients, the overall learning process is still just a MDP and we can apply any popular modern DRL methods to learn the adversary.

3.2. FINDING THE OPTIMAL POLICY UNDER A FIXED ADVERSARY

We now investigate SA-MDP when we fix the adversary ν and find an optimal policy. In Lemma 2, we show that this case SA-MDP becomes a POMDP: Lemma 2 (Optimal policy under fixed adversary) Given an SA-MDP M = (S, A, B, R, p, γ) and a fixed and stationary adversary ν(ŝ|s), there exists a POMDP M = (S, A, Ω, O, R, p, γ) s ← s 15: end for 16: return D such that the optimal policy of M is the optimal policy π for SA-MDP given the fixed ν, where Ω = s∈S {s |s ∈ supp(ν(•|s))}, O(o|s) = ν(ŝ|s) where Ω is the set of observations, and O defines the conditional observational probabilities (in our case it is conditioned only on s and does not depend on actions). To prove Lemma 2, we construct the a POMDP with the observations defined on the support of all ν(•|s), s ∈ S and the observation process is exactly the process of generating an adversarially perturbed state ŝ. This POMDP is functionally identical to the original SA-MDP when ν is fixed. This lemma unveils the connection between POMDP and SA-MDP: SA-MDP can be seen as a version of "robust" POMDP where the policy needs to be robust under a set of observational processes (adversaries). SA-MDP is different from robust POMDP (RPOMDP) (Osogami, 2015; Rasouli & Saghafian, 2018) , which optimizes for the worst case environment transitions. As a proof of concept, we use a modern POMDP solver, SARSOP (Kurniawati et al., 2008) to solve the GridWorld environment in Figure 1 to find a policy that can defeat the adversary. The POMDP solver produces a finite state controller (FSC) with 8 states (FSC is an efficient representation of history dependent policies). This FSC policy can almost eliminate the impact of the adversary and receive close to perfect reward, as shown in Figure 1c . Unfortunately, unlike MDPs, it is challenging to solve an optimal policy for POMDPs; state-of-theart solvers (Bai et al., 2014; Sunberg & Kochenderfer, 2017) can only work on relatively simple environments which are much smaller than those used in modern DRL. Thus, we do not aim to solve the optimal policy. We follow (Wierstra et al., 2007) to use recurrent policy gradient theorem on POMDPs and use LSTM as function approximators for the value and policy networks. We denote h t = {ŝ 0 , a 0 , ŝ1 , a 1 • • • , ŝt } containing all history of states (perturbed states ŝ in our setting) and actions. The policy π parameterized by θ takes an action a t given all observed history h t , and h t is typically encoded by a recurrent neural network (e.g., LSTM). The recurrent policy gradient theorem (Wierstra et al., 2007) shows that ∇ θ J ≈ 1 N N n=1 T t=0 ∇ θ log π θ (a n t |h n t )r n t ( ) where N is the number of sampled episodes, T is episode length (for notation similarity, we assume each episode has the same length), and h n t is the history of states for episode n up to time t, and r n t is the reward received for episode n at time t. We can then extend Eq. 2 to modern DRL algorithms such as proximal policy optimization (PPO), similarly as done in (Azizzadenesheli et al., 2018) , by using the following loss function: J(θ) ≈ 1 N N n=1 T t=0 min π θ (a n t |h n t ) π θold (a n t |h n t ) A h n t , clip( π θ (a n t |h n t ) π θold (a n t |h n t ) , 1 -, 1 + )A h n t (3) where A h n t is a baseline advantage function for episode n time step t, which is based on a LSTM value function. is the clipping threshold in PPO. The loss can be optimized via a gradient based optimizer and θ old is the old policy parameter before optimization iterations start. Although a LSTM or recurrent policy network has been used in the DRL setting in a few other works (Hausknecht & Stone, 2015; Azizzadenesheli et al., 2018) , our focus is to improve agent robustness rather than learning a policy purely for POMDPs. In our empirical evaluation, we will compare feedforward and LSTM policies under our ATLA framework.

3.3. ALTERNATING TRAINING WITH LEARNED ADVERSARIES (ATLA)

As we have discussed in Section 3.1, we can solve an optimal adversary given any fixed policy. In our ATLA framework, we train such an adversary online with the agent: we first keep the agent and optimize the adversary; the adversary is also parameterized as a neural network. Then we keep the adversary and optimize the agent. Both adversary and agent can be updated using a policy gradient algorithm such as PPO. We show our full algorithm in Algorithm 2.

Algorithm 2 Alternating Training with Learned Adversaries (ATLA)

Input: Environment E, number of iterations N iter , and batch size B. 1: Initialize the agent's actor network π(a|ŝ) with parameters θ. 2: Initialize the adversary's actor network ν(ŝ|s) with parameters φ. 3: for i = 1 to N iter do 4: for j = 1 to N π do 5: Run π θ with fixed ν φ to collect a set of trajectories D π := {(ŝ k,j t , a k,j t , r k,j t , ŝk,j t+1 )} end for 12: end for Our algorithm is designed to use a strong and learned adversary that tries to find intrinsic weakness of the policy, and to obtain a good reward the policy must learn to defeat such an adversary. In other words, it attempts to solve the SA-MDP problem directly rather than relying on explicit regularization on the function approximator like the approach in (Zhang et al., 2020b) . In our empirical evaluation, we show that such regularization can be unhelpful in some environments and harmful for performance when evaluating the agent without attacks. The difference between our approach and previous adversarial training approaches such as (Pattanaik et al., 2018 ) is that we use a stronger adversary, learned online with the agent. Our empirical evaluation finds that using such a learned "optimal" adversary in training time allows the agent to learn a robust policy generalized to different types of strong adversarial attacks during test time. Additionally, it is important to distinguish between the original state s and the perturbed state ŝ. We find that using s instead of ŝ to train the advantage function and policy of the agent leads to worse performance, as it does not follow the theoretical framework of SA-MDP.

4. EXPERIMENTS

"Optimal" attack on DRL agentsfoot_0 In section 3.1 we show that it is possible to cast the optimal adversary finding problem as an MDP problem. In practice, the environment dynamics are unknown but model-free RL methods can be used to approximately find this optimal adversary. In this section, we use PPO to train an adversary on four OpenAI Gym MuJoCo continuous control environments. Table 1 : Average episode rewards ± standard deviation over 50 episodes on PPO and SA-PPO agents. We report natural rewards (no attacks) and rewards under six adversarial attacks, including a simple random noise attack, the critic based attack in Pattanaik et al. (2018) ,MAD and RS attacks in Zhang et al. (2020b) , Snooping attack proposed in Inkawhich et al. (2019) , and the optimal attack proposed in this paper. In each row we bold the best (lowest) attack reward over all five attacks. "Optimal" attack is better than other attacks in all environments, sometimes by a large margin. (Zhang et al., 2020b) . Additionally, we include the blackbox Snooping attack (Inkawhich et al., 2019) . For all attacks we consider B(s) as a ∞ norm ball around s with radius , set similarly as in (Zhang et al., 2020b) . During testing, we run the agents without attacks as well as under attacks for 50 episodes and report the mean and standard deviation of episode rewards. In Table 1 our "optimal" attack achieves noticeably lower rewards than all the other five attacks. We illustrate a few examples of attacks in Figure 3 . For RS and "optimal" attacks, we report the best (lowest) attack reward obtained from different hyper-parameters. Evaluation of ATLA In this experiment, we study the effectiveness of our proposed ATLA method. Specifically, we use PPO as our policy optimizer. For policy networks, we have two different structures: the original fully connected (MLP) structure, and an LSTM structure which takes historical observations. The LSTMs are trained using backpropagation through time for up to 100 steps. In Table 2 we include the following methods for comparisons: • PPO (vanilla) and PPO (LSTM): PPO with a feedforward NN or LSTM as the policy network. • SA-PPO (Zhang et al., 2020b) : the state-of-the-art approach for improving the robustness of DRL in continuous control environments, using a smooth policy regularization on feedforward NNs solved by convex relaxations. • Adversarial training using critic attack (Pattanaik et al., 2018) : a previous work using critic based attack to generate adversarial observations in training time, and train a feedforward NN based agent with this relatively weak adversary. • ATLA-PPO (MLP) and ATLA-PPO (LSTM): Our proposed method trained with a feedforward NN (MLP) or LSTM as the policy network. The agent and adversary are trained using PPO with independent value and policy networks. For simplicity, we set N π = N ν = 1 in all settings. • ATLA-PPO (LSTM) +SA reg: Based on ATLA-PPO (LSTM), but with an extra adversarial smoothness constraint similar to those in SA-PPO. We use a 2-step stochastic gradient Langevin dynamics (SGLD) to solve the minimax loss, as convex relaxations of LSTMs are expensive. For each agent, we report its "natural reward" (episode reward without attacks) and best attack reward in Table 2 . To comprehensively evaluate the robustness of agents, the best attack reward is the lowest episode reward achieved by all six types attacks in Table 1 , including our new "optimal" attack (these attacks include hundreds of independent adversaries for attacking a single agent, see Appendix A.1 for more details). For reproducibility, for each setup we train 21 agents, attack all of them and report the one with median robustness. We include detailed hyperparameters in A.5. In Table 2 we can see that vanilla PPO with MLP or LSTM are not robust. For feedforward (MLP) agent policies, critic based adversarial training (Pattanaik et al., 2018) is not very effective under our suite of strong adversaries and is sometimes only slightly better than vanilla PPO. ATLA-PPO (MLP) outperforms SA-PPO on Hopper and Walker2d and is also competitive on HalfCheetah; for high dimensional environments like Ant, the robust function approximator regularization in SA-PPO is more effective. For LSTM agent policies, compared to vanilla PPO (LSTM) agents, ATLA-PPO (LSTM) can significantly improve agent robustness; a LSTM agent trained without a robust training procedure like ATLA cannot improve robustness. We find that LSTM agents tend to be more robust than their MLP counterparts, validating our findings in Section 3.2. ATLA-PPO (LSTM) is better than SA-PPO on Hopper and Walker2d. In all settings, especially for high dimensional environments like Ant, our ATLA approach that also includes State-Adversarial regularization (ATLA-PPO +SA Reg) outperforms all other baselines, as this combination improves both the intrinsic robustness of policy and the robustness of function approximator. A robust function approximator can be insufficient For some environments, SA-PPO method has its limitations -even using an increasingly larger regularization parameter κ (which controls how robust the function approximator needs to be), we still cannot reach the same performance as our ATLA agent (Figure 4 ). Additionally, when a large regularization is used, agent performance becomes much worse. In Figure 4 , under the largest κ = 1.0, the natural reward (1436 ± 96) is much lower than other agents reported in Table 2 .

5. CONCLUSION

In this paper, we first propose the optimal adversarial attack on state observations of RL agents, which is significantly stronger than many existing adversarial attacks. We then show the alternating training with learned adversaries (ATLA) framework to train an agent together with a learned optimal adversary to effectively improve agent robustness under attacks. We also show that a history dependent policy parameterized by a LSTM can be helpful for robustness. Our approach is orthogonal to existing regularization based techniques, and can be combined with state-adversarial regularization to achieve state-of-the-art robustness under strong adversarial attacks.

A APPENDIX A.1 FULL RESULTS OF ALL ENVIRONMENTS UNDER DIFFERENT TYPES OF ATTACKS

In Table 2 , we only include the best attack rewards (lowest rewards over all attacks). In Table 3 we list the rewards under each specific attack. Note that, Robust Sarsa (RS) attack and our "optimal" policy attack both have hyperparameters. For RS attack we use the same set of 30 different settings of hyperparameters as in (Zhang et al., 2020b) to train a robust value function to attack the network. The reported RS attack result for each agent is the strongest one over the 30 trained value functions. For Snooping based attack, we use the "imitator" attack proxy as it was the strongest reported in (Inkawhich et al., 2019) , and we attack every step of the agent. The imitator is a MLP or LSTM network according to agent policy network. We use the same loss KL divergence function as in the MAD attacks for this Snooping attack. We first collect state-action pairs for 100 episodes to train the "imitators", whose network structures are the same as the corresponding agents. In the test time, we first run MAD attack on it the "imitator" and then input the generated perturbed observation to the agent in a transfer attack fashion. For our "optimal" policy attack, the hyperparameters are PPO training parameters for the adversary (including the learning rate of the adversary policy network, learning rate of the adversary value network, the entropy regularization parameter and the ratio clip for PPO). We use a grid search of these hyperparameters to train an adversary that is as strong as possible, resulting in 100 to 200 adversaries produced for each agent. The reported optimal attack rewards is the lowest reward among all trained adversaries. Under this comprehensive adversarial evaluation, each agent is tested using hundreds of adversaries and the strongest adversary determines the true robustness of an agent. Table 3 : Average episode rewards ± standard deviation over 50 episodes on five baselines and SA-PPO (Zhang et al., 2020b) . We report natural episode rewards (no attacks) and episode rewards under six adversarial attacks, including a simple random noise attack, the critic based attack in (Pattanaik et al., 2018) , MAD and RS attacks in Zhang et al. (2020b) , Snooping attack proposed in Inkawhich et al. (2019) , and the optimal attack proposed in this paper. In each row we bold the best (lowest) attack reward over all five attacks. The row for the most robust method is highlighted. In Table 3 we only report the agent performance at the end of training. In this subsection, we evaluate our agent performance during 20%, 40%, 60% and 80% of total training epochs using Robust Sarsa (RS) attacks. The results are presented in Table 4 . The overall trend is that agents are getting stronger over time ("RS attack reward" is increasing), achieving better robustness in later checkpoints.

A.3 NETWORK STRUCTURE

For fully connected networks, we use the same network as in (Zhang et al., 2020b) , which contains 2 hidden layers with 64 hidden neurons each layer, for both policy and value networks, for both the agent and the adversary. For LSTM agents, we use a single layer LSTM with 64 hidden neurons, along with an input embedding layer projecting state dimension to 64 and an output layer projecting 64 to output dimension. For LSTM agents, when conducting the "optimal" attack, we also use a LSTM network for the adversary to ensure the adversary is powerful enough. A.4 HYPERPARAMETER FOR THE LEARNING-BASED "OPTIMAL" ATTACK Our "optimal" attacks uses policy gradient methods to learn the optimal adversary during agent testing, and each learning process involves the selection of hyperparameters. Specifically, the hyperparameters include the learning rates of the adversary's policy and value networks, the entropy coefficient, and the annealing of the learning rate. To reduce search space, for ATLA agents, the learning rates of the testing phase adversary's policy and value networks are chosen ranging from 0.3X to 3X of the rates of adversary's policy and value networks used in training. For other agents trained without an adversary, the learning rates of the testing phase adversary's policy and value networks are chosen ranging from 0.3X to 3X of the learning rates of the agent's policy and value networks. We tested both linearly annealed learning rate and non-annealing learning rate. The adversary's entropy coefficient is chosen form 0 and 0.003. The final results reported in all tables are the best (lowest) reward achieved by the "optimal" attacks among all hyperparameter configurations. Typically this includes around 100 to 200 different adversaries trained with different hyperparameters. This guarantees the strength of this attack and allows a comprehensive evaluation of the robustness of all agents. A.5 HYPERPARAMETERS FOR ATLA PERFORMANCE EVALUATION Hyperparameters for PPO (vanilla) For the Walker2d and Hopper environment, we use the same set of hyperparameters as in (Zhang et al., 2020b) ; the hyperparameters were originally from (Engstrom et al., 2020) and found using a grid search experiment. We found that this set of hyperparameters work well. For HalfCheetah and Ant environment, we use a grid search of hyperparameters, including the learning rate of the policy network, learning rate of the value network and the entropy bonus coefficient. For Hopper, Walker2d and HalfCheetah environments, we train for 2 million steps (2 million environment interactions). For Ant, we train for 10 million steps. Training for longer may slightly improve agent performance under no attacks, but has no impact for performance under strong adversarial attacks. Hyperparameters for PPO (LSTM) For PPO (LSTM), we conduct a smaller scale hyperparameter search. We search hyperparameter values that are close to the optimal ones found for the PPO vanilla agent. We train these LSTM agents for the same steps as those in vanilla PPO. Hyperparameters for SA-PPO We use the same value for all hyperparameters as in vanilla PPO except SA-PPO's extra κ for the strength of SA-PPO regularization. For κ, we choose from 1×10 -6 to 1. We train agents with each κ 21 times and choose the κ value whose median agent has the highest worst-case reward under all attacks. Hyperparameters for ATLA-PPO For ATLA-PPO, we have hyperparameters for both agent and adversary. We keep all agent hyperparameters the same as those in vanilla MLP/LSTM agents, except for the entropy bonus coefficient. We find that sometimes we need a larger entropy bonus co-Table 5 : Hyperparameters for all environments and settings. For vanilla environments, we use the hyperparameters from Zhang et al. (2020b) and Engstrom et al. (2020) if they are available for that environment (Hopper and Walker2d). Other environments' hyperparameter for the vanilla PPO model is found by a grid search. For SA-PPO and ATLA-PPO (MLP), the same set of hyperparameters as in the vanilla models are used, except that for SA-PPO we tune the parameter κ and for ATLA-PPO (MLP) we tune the entropy bonus coefficients as well as learning rates for the adversary. For LSTM models, we first tune the vanilla LSTM PPO models and find the best learning rates, keep using them in all LSTM based models. efficient in ATLA to allow sufficient exploration of the agent, as learning with an adversary is harder than learning in attack-free environments. For the adversary, we run a small-scale hyperparameter search on the learning rate of adversary policy and value networks, and the entropy bonus coefficient for the adversary. To reduce the number of hyperparameters for searching, we use values close to those of the agent. We set N ν = N π = 1 in all experiments and did not tune this hyperparameter. For ATLA, we train 5 million steps for Hopper, Walker and HalfCheetah and 10 million steps for Ant. We find that similar to the observations in (Madry et al., 2018) , training with an adversary typically requires more steps to converge, however in all our environments the training does reliably converge. Agent selection For each setup, we repeat the experiments using the same set of hyperparameters for 21 times due to the high performance variance in RL. We then attack all the agents using random, critic, MAD and RS attacks. We use the lowest reward among all attacks as a metric to rank those agents. Then, we select the agent with median robustness as our final agent. This final agemt is then attacked using the "optimal" attack to further reduce its reward. The numbers we report in Table 2 are not from the best runs, but the runs with median robustness. This is done to improve reproducibility as RL training process can have high variance.



Code for the optimal attack and ATLA available at https://github.com/huanzhang12/ATLA_robust_RL



(a) Path in unperturbed environment (found by policy iteration). Agent's reward = +1. Black arrows and numbers show actions and value function of the agent. (b) Path under the optimal adversary. Agent's reward = -∞. Red arrows and numbers show actions and value function of the optimal adversary (Section 3.1).

Figure 2: SA-MDP introduces an adversary on state observations in a MDP.

Figure 3: Our "Optimal" Attack and Robust Sarsa attack (a previous strong attack proposed in Zhang et al. (2020b)) on Ant and HalfCheetah environments. Previous strong attacks make the agent fail and receive a small positive reward (less than 1/10 of the reward without attack). Our attack is strong enough to trick the agent into moving to the opposite direction, receiving a large negative reward.

Figure 4: The performance under the strongest attack for SA-PPO Hopper with different regularization κ. Even we increase regularization, it cannot outperform our ATLA agents.

Algorithm 1 Learning an "optimal" adversary for perturbations on state observations Input: Policy π(•|s) under attack, number of iterations N iter , batch size B, perturbation set B(s)1: initialize adversary ν φ (•|s) parameterized by a neural network with parameters φ, 2: for i = 1 to N iter do

Average episode rewards ± standard deviation over 50 episodes on ATLA agents and baselines. We report natural rewards (no attacks) and the best (lowest) attack rewards among six types of adversarial attacks, including a simple random noise attack, the critic based attack in(Pattanaik et al., 2018), MAD and RS attacks inZhang et al. (2020b), Snooping attack proposed inInkawhich et al. (2019), and the optimal attack proposed in this paper. For each environment, we bold the most robust agent. Since both RS attack and our "optimal" attack are parameterized attacks, the "best attack" column represents the worst case agent performance under hundreds of adversaries. See Appendix A.1 for more details.

Natural and RS attack rewards of ATLA-PPO (LSTM)+ SA Reg checkpoints during training. We report Average rewards± standard deviation over 50 episodes.

