VULNERABILITY-AWARE POISONING MECHANISM FOR ONLINE RL WITH UNKNOWN DYNAMICS

Abstract

Poisoning attacks on Reinforcement Learning (RL) systems could take advantage of RL algorithm's vulnerabilities and cause failure of the learning. However, prior works on poisoning RL usually either unrealistically assume the attacker knows the underlying Markov Decision Process (MDP), or directly apply the poisoning methods in supervised learning to RL. In this work, we build a generic poisoning framework for online RL via a comprehensive investigation of heterogeneous poisoning models in RL. Without any prior knowledge of the MDP, we propose a strategic poisoning algorithm called Vulnerability-Aware Adversarial Critic Poison (VA2C-P), which works for on-policy deep RL agents, closing the gap that no poisoning method exists for policy-based RL agents. VA2C-P uses a novel metric, stability radius in RL, that measures the vulnerability of RL algorithms. Experiments on multiple deep RL agents and multiple environments show that our poisoning algorithm successfully prevents agents from learning a good policy or teaches the agents to converge to a target policy, with a limited attacking budget.

1. INTRODUCTION

Although reinforcement learning (RL), especially deep RL, has been successfully applied in various fields, the security of RL techniques against adversarial attacks is not yet well understood. In realworld scenarios, including high-stakes ones such as autonomous driving vehicles and healthcare systems, a bad decision may lead to a tragic outcome. Should we trust the decision made by an RL agent? How easy is it for an adversary to mislead the agent? These questions are crucial to ask before deploying RL techniques in many applications. In this paper, we focus on poisoning attacks, which occur during the training and influence the learned policy. Since training RL is known to be very sample-consuming, one might have to constantly interact with the environment to collect data, which opens up a lot of opportunities for an attacker to poison the training samples collected. Therefore, understanding poisoning mechanisms and studying the vulnerabilities in RL are crucial to provide guidance for defense methods. However, existing works on adversarial attacks in RL mainly study the test-time evasion attacks (Chen et al., 2019) where the attacker crafts adversarial inputs to fool a well-trained policy, but does not cause any change to the policy itself. Motivated by the importance of understanding RL security in the training process and the scarcity of relevant literature, in this paper, we investigate how to poison RL agents and how to characterize the vulnerability of deep RL algorithms. In general, RL is an "online" process: an agent rolls out experience from the environment with its current policy, and uses the experience to improve its policy, then uses the new policy to roll out new experience, etc. Poisoning in online RL is significantly different from poisoning in classic supervised learning (SL), even online SL, and is more difficult due to the following challenges. Challenge I -Future Data Unavailable in Online RL. Poisoning approaches in SL (Muñoz-González et al., 2017; Wang & Chaudhuri, 2018) usually require the access to the whole training dataset, so the attacker can decide the optimal poisoning strategy before the learning starts. However, in online RL, the training data (trajectories) are generated by the agent while it is learning. Although the optimal poison should work in the long run, the attacker can only access and change the data in the current iteration, since the future data is not yet generated. Challenge II -Data Samples No Longer i.i.d.. It is well-known that in RL, data samples (state-action transitions) are no longer i.i.d., which makes learning challenging, since we should consider the longterm reward rather than the immediate result. However, we notice that data samples being not i.i.d. also makes poisoning attacks challenging. For example, an attacker wants to reduce the agent's total reward in a task shown as Figure 1 ; at state s 1 , the attacker finds that a 1 is less rewarding than a 0 ; if the attacker only looks at the immediate reward, he will lure the agent into choosing a 1 . However, following a 1 finally leads the agent to s 10 which has a much higher reward. Challenge III -Unknown Dynamics of Environment. Although Challenge I and II can be partially addressed by predicting the future trajectories or steps, it requires prior knowledge on the dynamics of the underlying MDP. Many existing poisoning RL works (Rakhsha et al., 2020; Ma et al., 2019) assume the attacker has perfect knowledge of the MDP, then compute the optimal poisoning. However, in many real-world environments, knowing the dynamics of the MDP is difficult. Although the attacker could potentially interact with the environment to build an estimate of the environment model, the cost of interacting with the environment could be unrealistically high, market making (Spooner et al., 2018) for instance. In this paper, we study a more realistic scenario where the attacker does not know the underlying dynamics of MDP, and can not directly interact with the environment, either. Thus, the attacker learns the environment only based on the agent's experience. In this paper, we systematically investigate poisoning in RL by considering all the aforementioned RL-specific challenges. Previous works either do not address any of the challenges or only address some of them. Behzadan & Munir (2017) achieve policy induction attacks for deep Q networks (DQN). However, they treat output actions of DQN similarly to labels in SL, and do not consider Challenge II that the current action will influence future interactions. Ma et al. (2019) propose a poisoning attack for model-based RL, but they suppose the agent learns from a batch of given data, not considering Challenge I. Rakhsha et al. (2020) study poisoning for online RL, but they require perfect knowledge of the MDP dynamics, which is unrealistic as stated in Challenge III.

Summary of Contributions. (1)

We propose a practical poisoning algorithm called Vulnerability-Aware Adversarial Critic Poison (VA2C-P) that works for deep policy gradient learners without any prior knowledge of the environment. To the best of our knowledge, VA2C-P is the first practical algorithm that poisons policy-based deep RL methods. (2) We introduce a novel metric, called stability radius, to characterize the stability of RL algorithms, measuring and comparing the vulnerabilities of RL algorithms in different scenarios. (3) We conduct a series of experiments for various environments and state-of-the-art deep policy-based RL algorithms, which demonstrates RL agents' vulnerabilities to even weaker attackers with limited knowledge and attack budget.

2. RELATED WORK

The main focus of this paper is on poisoning RL, an emerging area in the past few years. We survey related works of adversarial attacks in SL and evasion attacks in RL in Appendix A, as they are out of the scope of this paper. Targeted Poisoning Attacks for RL. Most RL poisoning researches work on targeted poisoning, also called policy teaching, where the attacker leads the agent to learn a pre-defined target policy. Policy teaching can be achieved by manipulating the rewards (Zhang & Parkes, 2008; Zhang et al., 2009) or dynamics (Rakhsha et al., 2020) of the MDP. However, they require the attackers to not only have prior knowledge of the environments (e.g., the dynamics of the MDP), but also have the ability to alter the environment (e.g. change the transition probabilities), which are often unrealistic or difficult in practice. Poisoning RL with Omniscient Attackers. Most guaranteed poisoning RL literature (Rakhsha et al., 2020; Ma et al., 2019) assume omniscient attackers, who not only know the learner's model, but also know the underlying MDP. However, as motivated in the introduction, the underlying MDP is usually either unknown or too complex in practice. Some works poison RL learners by changing the reward signals sent from the environment to the agent. For example, Ma et al. (2019) introduce a policy teaching framework for batch-learning model-based agents; Huang & Zhu (2019) propose a reward-poisoning attack model, and provide convergence analysis for Q-learning; Zhang et al. (2020b) present an adaptive reward poisoning method for Q-learning (while it also extends to DQN) and analyze the safety thresholds of RL; these papers all assume the attacker knows not only the models of the agent, but also the parameters of the underlying MDP, which could be possible in a tabular MDP, but hard to realize in large environments and modern deep RL systems. On the contrary, we consider non-omniscient attackers who do not know the underlying MDP or environment in this paper. The non-omniscient attackers can be further divided into two categories: white-box attackers, who know the learner's model/parameters, and black-box attackers, who do not know the learner's model/parameters. They both tap the interactions between the learner and the environment. Black-box Poisoning for Value-based Learners. Although there are many successful black-box evasion approaches (Xinghua et al., 2020; Inkawhich et al., 2020) , black-box poisoning in RL is rare. There is a black-box attacking method for a value-based learner (DQN) proposed by Behzadan & Munir (2017) , which does not require the attacker to know the learner's model or the underlying MDP. In this work, the attacker induces the DQN agent to output the target action by perturbing the state with Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) in every step. However, the data-correlation problem of RL (Challenge II) is not considered, and FGSM attack does not work for policy-based methods due to their high stochasticity, as we show in experiments. In this paper, we propose a new poisoning algorithm for policy-based deep RL agents, which can achieve both non-targeted and targeted attacks. We do not require any prior knowledge of the environment. And our algorithm works not only when the attacker knows the learner's model (whitebox), but also when the leaner's model is hidden (black-box).

3.1. NOTATIONS AND PRELIMINARIES

In RL, an agent interacts with the environment by taking actions, observing states and receiving rewards. The environment is modeled by a Markov Decision Process (MDP), which is denoted by a tuple M = S, A, P, R, γ, µ , where S is the state space, A is the action space, P is the transition kernel, R is the reward function, γ ∈ (0, 1) is the discount factor, and µ is the initial state distribution. A trajectory τ ∼ π generated by policy π is a sequence s 1 , a 1 , r 1 , s 2 , a 2 , • • • , where s 1 ∼ µ, a t ∼ π(a|s t ), s t+1 ∼ P (s|s t , a t ) and r t = R(s t , a t ). The goal of an RL agent is to find an optimal policy π * that maximizes the expected total rewards η, which is defined as η(π) = E τ ∼π [r(τ )] = E s1,a1,•••∼µ,π,P,R [ ∞ t=1 γ t-1 r t ]. We use an overhead check signˇon a variable to denote that the variable is poisoned. For example, if the attacker perturbs a reward r t , then the poisoned reward is denoted as řt . If a policy π is updated with poisoned observation, then it is denoted as π. Procedure of Online Learning. We consider a classical online policy-based RL setting, where the learner iteratively updates its policy π parametrized by θ, through K iterations with the environment. For notation simplicity, we omit θ and use π k to denote π θ k , the learner's policy at iteration k. The online learning process is described as below.

3.2. THE PROCEDURE OF ONLINE LEARNING AND POISONING

π 0 π 1 π 2 O 1 Ǒ1 O 2 Ǒ2 M 1 M 2 At iteration k = 1, • • • , K, (1) The agent uses the current policy π k to roll out observation O k = (O s k , O a k , O r k ) from envi- ronment M k , where O s k = [s 1 , s 2 , • • • ], O a k = [a 1 , a 2 , • • • ], O r k = [r 1 , r 2 , • • • ] are respectively the sequence of states, actions and rewards generated at iteration k. (2) The agent updates its policy parameters θ with its algorithm f . Most policy-based algorithms perform on-policy updating, i.e., update policy only by the current observation O k . The on-policy update can be then formalized as π k+1 = f (π k , O k ) ≈ argmax π J(π, π k , O k ) , where J is an objective function defined by algorithm f , e.g., the expected total reward η(π). Procedure of Online Poisoning. A poisoning attacker influences the learner in the training process by perturbing the training data. In SL, the training data consists of features and labels, and the attacker poisons the training data before the learning starts. However, the training data in RL is the trajectories a learner rolls out from the environment, i.e., observation O = (O s , O a , O r ). At iteration k, the attacker eavesdrops on the interaction between the learner and the environment, obtains the observation O k , and may poison it into Ǒk , then send Ǒk to the learner before policy updating.foot_0 Procedure 2 in Appendix B illustrates how the online game goes between the learner and the attacker. Figure 2 visualizes this online learning-poisoning procedure, where we can see that learning and poisoning are convoluted and inter-dependent. 3.3 A UNIFIED FORMULATION FOR POISONING ONLINE POLICY-BASED RL.

3.3.1. ATTACKER'S POISON AIM

As defined in Section 3.1, the observation is a collection of trajectories, consisting of the observed states O s , the executed actions O a or the received rewards O r . When poisoning the observation, the attacker may only focus on the states, or on the actions, or on the rewards. We call the quantity being altered as the poison aim of the attacker, denoted by D ∈ {O s , O a , O r }. For example, D = O s means the attacker chooses to attack the states. Distinguishing different poison aims is important, since in real-world applications different poison aims correspond to different behaviors of the attacker. For example, in an RL-based recommender system, the RL agent recommends an item (i.e., an action) for a user (i.e., a state), and the user may or may not choose to click on the recommended item (i.e., a reward). An adversary might manipulate the reward (poisoning D = O r ), e.g., blocking the user's click from the RL agent or creating a fake click. An adversary might also alter the state (poisoning D = O s ), e.g., raising a teenager user's age which could result in inappropriate recommendations. An adversary might also change the action (poisoning D = O a ), e.g., inserting a fake recommendation into the agent's list of recommendations. Under different scenarios, the feasibility of poisoning different aims may vary. Most existing works on poisoning RL only solve one type of poison aim. Zhang et al. (2020b) ; Huang & Zhu (2019) propose to poison rewards, and Behzadan & Munir (2017) assume the attacker poison the states. However, in our paper, we provide a general solution for any of the poison aims to satisfy the needs in different scenarios. Our proposed method also supports a "hybrid" poison aim, where the attacker could switch aims at different iterations, as discussed in Section 5.

3.3.2. A POISONING FRAMEWORK FOR RL

We focus on proposing a poisoning mechanism for the above challenging online learning scenario. We first formalize the poisoning attacking at iteration k as a sequential bilevel optimization problem in Problem (Q), and explain the details of the problem in the remaining of this section. argmin Ďk ,••• , ĎK K j=k λ j L A (π j+1 ) ((a) attacker's weighted loss) s.t. πj+1 = argmax π J(π, πj , Ǒj | Ďj ), ∀k ≤ j ≤ K ((b) imitate the learner) K j=1 1{ Ďj = D j } ≤ C ((c) limited-budget) U (D j , Ďj ) ≤ , ∀1 ≤ j ≤ K ((d) limited-power) (Q) (a) Attacker's Weighted Loss. L A (π) measures the attacker's loss w.r.t. a poisoned policy π. As the definition of poisoning implies, the attacker influences or misleads the learner's policy. λ k:K are the weights of future attacker losses, controlling how much the attacker value the poisoning results in different iterations. The goal of the attacker is either (1) non-targeted poisoning, which minimizes the expected total rewards of the learner, i.e., L A = η(π), or (2) targeted poisoning, which induces the learner to learn a pre-defined target policy, i.e., L A = distance(π, π † ), where distance(π, π † ) can be any distance measure between a learned policy π and a target policy π † . Note that the targeted poisoning objective can usually be directly computed with a prior target policy, while non-targeted poisoning has a "reward-minimizing" objective, which is the reverse of the learner's objective. Without any prior knowledge of the environment, non-targeted poisoning is usually more difficult, as the attacker needs to first learn "what is the worst way" (which is as difficult as a learning problem by a RL agent) and then lead the learner to that way (which is as difficult as a targeted poisoning problem, assuming leading an agent to different policies is roughly equally challenging). However, most existing poisoning work focus on targeted poisoning, which requires the attacker to know a pre-defined target policy. Thus in this paper, we make more efforts to solve the reward-minimizing poisoning problem, which may deprave the policy without any prior knowledge. (b) Imitate the Policy-Based Learner. To confidently mislead a learner, the attacker needs to predict how the learner will behave under the poison, which can be achieved by imitating the learner using the learner's observation. More specifically, at the j-th iteration, the attacker estimates the learner's policy to be πj , called imitating policy. (c,d) Limited-budget and Limited-power. In practice, the ability of an attacker is usually restricted by some constraints. For the online poisoning problem, we consider attacker's constraints in two forms: (1) (attack budget C) the total number of iterations that the attacker could poison does not exceed C; (2) (attack power ) in one iteration, the total changefoot_1 U (D k , Ďk ) between D k and Ďk can not be larger than . Attack power controls the amount of perturbation, as commonly used in the adversarial learning literature. Attack budget considers the frequency of attack, which is similar to the constraint studied by Wang & Chaudhuri (2018) . Problem (Q) is a generic formulation, covering a variety of poisoning models, and specifies the best poison an attacker can execute. However, directly solving Problem (Q) is prohibitive, as (1) the future observations O k+1:K are unknown when poisoning the k-th iteration, as the attacker has no knowledge of the underlying MDP. (2) the limited-budget constraint is analogous to 0 -norm regularization, which is generally NP-hard (Nguyen et al., 2019) ; and (3) minimizing attacker's loss while maximizing learner's gain is non-convex minimax optimization, which is a complex problem (Perolat et al., 2015) . In spite of the above difficulties, we introduce a practical method to approximately and effectively solve Problem (Q) in Section 4.

4. VA2C-P: POISON POLICY GRADIENT LEARNERS

In this section, we propose a practical and efficient poisoning algorithm called Vulnerability-Aware Adversarial Critic Poison (VA2C-P) for policy gradient learners. Without loss of generality, we assume the loss weights λ j = 1 for all j = 1, • • • , K. Main Idea. As discussed in Section 3.3, Problem (Q) is difficult mainly because of the unknown future observations and the limited budget constraint. In other words, it is hard to exactly determine (1) what kind of attack benefits the future the most, and (2) which iterations are worth attacking the most. Thus, we propose to break Problem (Q) into two decisions: when to attack, and how to attack. The "when to attack" decision allocates the limited budget to iterations which are more likely to be influenced by the attacker, and the "how to attack" decision utilizes limited power to minimize the attacker's loss. We introduce two mechanisms of VA2C-P, vulnerability-awareness and adversarial critic, to make these two decisions respectively.

4.1. DECISION 1: WHEN TO ATTACK -VULNERABILITY-AWARE

To answer the question of when to attack, we identify the iterations under which the learner's policy gets more depraved by the same level of attack power. Inspired by the notion of stability in learning theory, which measures how a machine learning algorithm changes due to a small perturbation of the input data, we formally investigate the stability of an RL algorithm, which is the first attempt in the existing literature to the best of our knowledge. Stability of RL Algorithms. We first focus on a single update process of an algorithm f . An update π = f (π, O) is stable if a limited-power poisoning attack does not cause any difference on the output policy π . That is, the learning algorithm produces the same result regardless of the presence of the poison. More formally, we define the concept of stability radius of one update in Definition 1. Definition 1 (Stability Radius of One Update). 3 For the update of an RL algorithm π = f (π, O), with any poison aim D, the δ-stability radius of the update is defined as the minimum poison power needed to cause δ change in policy (called δ-policy-discrepancy) φ δ,D (f, π, O) = inf {∃ Ď s.t. U (D, Ď) ≤ and d max [π ||π ] > δ, where π = f (π, Ǒ| Ď)}, Policy discrepancy d max [π 1 ||π 2 ] = max s d π 1 (•|s)||π 2 (•|s) , where d[•||•] could be any measure of distribution distance. Remarks. (1) The one-update stability radius is w.r.t. the algorithm f , the old policy π, the clean observation O and the poison aim D. (2) Poison with power under φ δ,D (f, π, O) will not cause the policy distributions to change more than δ. (3) As shown by Proposition 2 in Appendix D.1, poison with power under φ δ,D (f, π, O) will not make the policy value drop more than O δ 2 γ(1 - γ) -2 max s,a |A π (s, a)| , where A is the advantage function, i.e., A π (s, a) = Q π (s, a) -V π (s). Stability radius measures the minimal effort needed to make the poisoned next-policy π notably different from the clean next-policy π which the learner will get if no poison is applied. Assuming max s,a |A π (s, a)| does not drastically change for the learner's policy during training, then an attack that causes higher policy discrepancy between π and π could cause more drop of the policy value. Therefore, the idea of vulnerability-aware attack is to estimate the vulnerability of each update and attack the most vulnerable ones. More specifically, if the attacker finds an -powered attack results in a policy discrepancy larger than some threshold δ, then it can conclude φ δ (f, π, O) ≤ , and the current update is relatively vulnerable. In practice, we have the fixed budget C instead of δ, so we perform the vulnerability check in an adaptive way: attacking the C iterations where -powered attacks can trigger the highest policy discrepancies. Section 4.3 introduces an algorithm to realize this idea.

4.2. DECISION 2: HOW TO ATTACK -ADVERSARIAL CRITIC

Since "when to attack" decision tackles the limited-budget constraint, now we turn our attention to minimizing the attacker's loss while satisfying the limited-power constraint. If the attacker has already decided to poison the k-th iteration due to its high vulnerability, then the decision of how to attack should be made before the learner uses the observation to update the policy. We relax the original Problem (Q) into Problem (P) as below. argmin Ďk L A (π k+1 ) s.t. πk+1 = argmax π J(π, πk , Ǒk | Ďk ) U (D k , Ďk ) ≤ (P) Compared with Problem (Q), Problem (P) does not consider future losses, which require the unavailable future observations. Instead, Problem (P) finds a greedy attack Ďk to minimize the loss of the immediate next iteration. The solution to Problem (P) is always feasible to Problem (Q), although might not be optimal. Appendix F.2 provides details about the optimality of Problem (Q). In practice, it is also challenging to estimate L A (π). For targeted attacking, L A = distance(π, π † ) is directly computable with a properly defined distance metric, but for non-targeted attacking, the loss L A = η(π) can not be directly computed, since π is not the behavior policy. Although one can use importance sampling to evaluate η(π) with the current trajectories generated by the learner's policy, i.e., E τ ∼π k-1 [ π k (τ ) π k-1 (τ ) r(τ )], it may suffer from a high variance (Schulman et al., 2015b) when there are few trajectories. To solve this challenge, we introduce another mechanism, adversarial critic. Adversarial Critic. Under poisoning, the value network (if any) held by the learner usually fails to fit the correct value of its policy, since it does not observe the real trajectories generated by its policy. However, the attacker observes the real trajectories before poisoning, and is able to learn the real values to make the attacking stronger. Inspired by the Actor-Critic method, we propose to let the attacker learn a value function (network) Ṽω with observations of the learner, i.e., the attacker learns a critic of the learner's current policy π. Then the attacker can use Ṽω to design poisoned observations, directing the learner to a decreasing-value direction, which is called Adversarial Critic. Then, using importance sampling, the attacker's loss becomes E s,a∼π k [ π(a|s) π k (a|s) G(s t , a t )-Ṽω (s t ) ], where G is the discounted future reward T i=t γ i-t r t .

4.3. POISONING ALGORITHM VA2C-P.

Algorithm 1: Vulnerability-Aware Adversarial Critic Poison Input: total iterations of learning K; poisoning power ; poisoning budget C; Initialize a list of policy discrepancies Ψ = ∅, the number of poisoned iterations c = 0 Initialize an adversarial critic network Ṽω and an imitating policy network In an iteration, an attacker first finds a good perturbation for observation O with the current attack power , then estimates how much the output policy will change by computing the policy discrepancy between the clean next-policy and the poisoned next-policy. Then the attacker poisons if the policy discrepancy ranks high in the historical policy discrepancies. In Line 11, we use projected gradient descent to solve Problem P. Computation details are illustrated in Appendix E. π for k = 1, • • • , K do if c > C then Algorithm 1 covers both white-box and black-box settings, both of which maintain an imitating policy π to keep track of the learner's potential status. As mentioned in Section 3.3, a white-box attacker knows the learner's current policy, thus he can directly copy the learner's parameters to his imitating policy (Line 9), then predict the next-policy the learner would get under different observations. In contrast, a black-box attacker does not know the learner's policy, but he can update its imitating policy using the same observation as the learner uses (Line 18, 22). As claimed and verified by many black-box attacking methods (Behzadan & Munir, 2017) , adversarial attacks are usually transferable, i.e., if the attack works on the imitating learner, then the attack is also likely to work for the real learner. Note that as assumed by Behzadan & Munir (2017) , the black-box attacker knows what RL algorithm the learner is using (e.g., PPO, A2C, etc), so that the black-box attacker computes its estimation for the next-policy in Line 10 and Line 12.

5. EXPERIMENTS

In this section, we evaluate the performance of VA2C-P by poisoning multiple algorithms on various environments. We demonstrate that VA2C-P can effectively reduce the total reward of a training agent, or force the agent to choose a specific policy with limited power and budget. Moreover, VA2C-P works for heterogeneous poison aims, and works in both white-box and black-box settings. Experiment Setup. We choose 4 policy-gradient learning algorithms, including Vanilla Policy Gradient (Sutton et al., 2000) , A2C (Mnih et al., 2016) , ACKTR (Wu et al., 2017) and PPO (Schulman et al., 2017) . And we choose 5 Gym (Brockman et al., 2016) Reward-minimizing Poisoning. Baselines. To the best of our knowledge, there is no existing poisoning algorithm against deep policy-gradient algorithms. Although some evasion methods (Inkawhich et al., 2020 ) also work for policy-gradient algorithms, evasion is substantially different from poisoning since it does not influence the policy. Therefore, to show the effectiveness of VA2C-P, we compare it with 3 baselines: (1) the normal learning with no poison; (2) a random attacker which randomly chooses C iterations, and perturbs the reward to an arbitrary direction by ; and (3) a simplified version of our algorithm, called Adversarial Critic Poison (AC-P), which decides "how to attack" in the same way as VA2C-P does, but chooses "when to attack" randomly. Performance. We first show the reward-minimizing performance of VA2C-P on all three types of poison aims, assuming the attacker knows the learner's model (white-box attack). Figure 3 shows the rewards of various learners under different kinds of poisoning methods, with different ratios of budget C to the total number of iterations K. Compared with random poisoning, our proposed VA2C-P and the simplified version AC-P make the reward drop more significantly, demonstrating the effectiveness of our "how to attack" decisions made by the Adversarial Critic. VA2C-P further outperforms AC-P in almost all cases, which implies that our "when to attack" decisions based on Vulnerability-Awareness work well in practice. An interesting observation is random poisoning not only does not work well in many cases, but sometimes also facilitates the learner (Figure 3h ). This phenomenon is mainly due to the uncertainty of the environment, as pointed out by Challenge I and Challenge II in Section 1. Thus, a good poisoning strategy is important. Black-box Poisoning. Figure 3 demonstrates the performance of VA2C-P when the attacker knows the learner's model. But when the learner's model is not available (black-box attack), VA2C-P could also work as shown in Figure 4a , where one can see that black-box poisoning is still effective, although worse than white-box poisoning due to the lack of knowledge. Hybrid-aim Poisoning. The experiments shown in Figure 3 assume that the attacker poisons one of the 3 poison aims, O s , O a or O r . But if the attacker happens to have access to all of these poison aims, it can do better by performing a "hybrid" poisoning, i.e., at each iteration, evaluate the vulnerability of the algorithm w.r.t. each poison aim with their corresponding attacker power, then choose the one with the highest vulnerability. Figure 4b shows that adaptive attacking using a hybrid poison aim could significantly outperform attacking using a fixed single poison aim. Targeted Poisoning. Baselines. Although targeted RL poisoning is studied by many works (Huang & Zhu, 2019; Rakhsha et al., 2020) , few of them work for deep RL. Despite that no existing work focuses on deep policy-based methods, we transfer the FGSM-based targeted poisoning method proposed for DQN by Behzadan & Munir (2017) to attack policy-based learners as our baseline. The attacking method is, for each new state, adding a perturbation to it such that the perturbed state is pushed across the decision boundary and towards the target action. See Appendix G.2 for details. Performance. Figure 4c and 4d respectively show the targeted-attack performance comparison on CartPole and LunarLander, where the target policies are both "always going to the right". The y-axes show the proportion of target actions among all actions taken by the learner. VA2C-P successfully leads the learner to learn the target policy using all types of poison aims with relatively small attack power. In contrast, FGSM fails to let the learner select the target action in most cases, even with much larger attack power. The main reason why FGSM works for DQN Behzadan & Munir (2017) but not for policy-gradient methods is, policy-based agents generally learn stochastic policies, and even though FGSM could perturb the state such that the output probability for the target action is 0.51, the agent will still choose other actions with probability 0.49. Therefore, FGSM cannot easily perform targeted poisoning against policy-based learners. More experiment results are in Appendix G.3. Note that our proposed poisoning method can also be extended to off-policy learners, as discussed in Appendix G.4.

Appendix: Vulnerability-Aware Poisoning Mechanism for Online RL with Unknown Dynamics

A ADDITIONAL RELATED WORKS Adversarial Attacks in SL. There have been many works studying adversarial attacking and defending in the past decade (Huang et al., 2011; Steinhardt et al., 2017) . Poisoning, as an important type of adversarial attacking, has been well-studied in the field of machine learning (Biggio et al., 2012; Mei & Zhu, 2015) , including deep learning (Muñoz-González et al., 2017; Shafahi et al., 2018) . Wang & Chaudhuri (2018) consider online learning, which is similar to ours. However, they focus on supervised learning, where the attacker knows all the data stream, while in our RL setting, the whole training data stream is not available to the attacker (as Challenge I states). Evasion Attacks in RL. Recently, adversarial RL is attracting more and more attention, especially evasion attacks at test-time, as summarized by Chen et al. (2019) . Most works consider adversarial perturbations on observations, similar to adversarial examples in supervised learning. Huang et al. (2017) first show that neural network policies are vulnerable to evasion attacks on states, by generating state perturbation with FGSM. Lin et al. (2017) consider the data-correlation problem in RL and an attacker with limited ability, and propose a strategical attack method, which perturbs input states only under certain conditions. In addition, Gleave et al. (2019) show that in multi-agent RL, choosing an adversarial policy could also negatively affect the victim agent, which is similar to perturbing a state into a novel one. There are plenty of following works showing the vulnerability of deep RL policies, even against a limited-power black-box attacker Xinghua et al. ( 2020 

B A POISONING FRAMEWORK FOR RL

In this section, we establish a generic framework of poisoning in online RL, systematically characterizing its challenges and difficulties from multiple perspectives -objective of poisoning, various poison aims, and attacker's knowledge. Our in-depth comparison with the SL allows a thorough understanding of the additional vulnerability of online RL systems compared with the well-understood SL systems. Our framework also provides a clear context to correctly position prior works in the literature as well as to compare our work with existing works. Compared to the poisoning framework described by Huang & Zhu (2019) , we provide a solution for unifying these poison aims in one attack model in this section. We consider the online learning scenario, where the RL agent (the learner) does not know the dynamics or rewards of the underlying MDP M with state space S, action space A, transition dynamics P , rewards R and discount factor γ.

Settings and Notations

In online RL, the learner interacts with the environment and collects observations. The learner's algorithm, denoted by f , iteratively searches for a policy π parametrized by θ, through K interactions with the environment. Before learning starts, the learner initializes a policy π 1 . At each iteration k, the learner uses its previous policy π k-1 to roll out observations O k from the MDP M. O k is a concatenation of multiple trajectories, denoted as O k = (s k , a k , r k ), where s k = [s 1 , s 2 , • • • ], a k = [a 1 , a 2 , • • • ], r k = [r 1 , r 2 , • • • ]] are respectively the sequence of states, actions, and rewards in iteration k. Then, with the observations O k , the learner updates its policy by attempting to solve argmax π J(π, π k-1 , O k ), where J is the objective function. The generated policy by the learner's algorithm π k = f (π k-1 , O k )foot_4 does not necessarily achieve the maximization of the objective function. In this paper, an overhead check sign ˇon a variable always denotes that the variable is poisoned. For example, if the attacker changes a reward r t , then the poisoned reward is denoted as řt . Learner updates its policy: π k+1 = f (π k , Ǒk ) ≈ argmax π J(π, π k , Ǒk ) Poison Objective. We use L A to denote loss function of the poisoning attack, which the attacker attempts to minimize. The form of L A is determined by its goal, which falls into one of the two categories, non-targeted and targeted poisoning. In non-targeted poisoning, the attack poisons a policy π to π to minimize the learner's expected rewards. Therefore the poison objective L A is to minimize the learner's value η(π). In targeted poisoning, the attack "teaches" the agent to learn a pre-defined target policy π † . Therefore the poison objective L A is defined as distancefoot_5 (π, π † ). Most existing poison RL researches focus on targeted poisoning (Ma et al., 2019; Rakhsha et al., 2020) , and non-targeted poisoning, although discussed by Huang & Zhu (2019) , remains relatively untouched. poison aims. To influence the behaviors of the learner, an attacker could inject poison at multiple locations of the learner's learning process as detailed in Figure 5 . Part of the reason why poisoning in RL is more challenging than in SL is that it involves more poison aims of poisoning, some of which adapt with the environment, increasing the uncertainty. Poison Aim III -Poison Executor E a . The executor of the learner could be poisoned. For example, an attacker applies a force to the agent, so that the intended action "north" becomes "northeast". Pinto et al. (2017) train a robust RL agent against the executor poisoner. Denote this type of poisoning as E a -poisoning. We show in the Appendix C that E a -poisoning is equivalent to directly changing a stored in the observation O = (s, a, r, d).

Poison Training Data

Attacker's Knowledge At the k-th iteration, what an attacker can do depends on its current knowledge set, denoted by K k . K k could contain the underlying MDP M, the learner's algorithm f , the learner's previous policy models θ 1:k-1 as well as the previous and current observations O 1:k . An omniscient attacker knows everything , i.e, K (O) k = {M, f, θ 1:k-1 , O 1:k }. Most guaranteed policy teaching literature (Rakhsha et al., 2020; Ma et al., 2019) assume omniscient attacker. However as motivated in the introduction, it is often unrealistic to exactly know the underlying environment. We discuss two more realistic setting where the attacker only has limited knowledge as follows. A monitoring attacker has some information but does not assume knowledge of the underlying MDP M, i.e., K (M ) k = {f, θ 1:k-1 , O 1:k }. This is especially relevant in applications where learner's information is not secure (or even open), or an attacker hacks to steal information from the learner. Monitoring attacker is similar to the white-box attacker in supervised learning. A tapping attacker has very limited knowledge and knows the observations only, i.e., K (T ) k = {O 1:k }. This is widely applicable since the tapping the communication between the learner and the environment is easy. Tapping attacker is analogous to the black-box attacker in supervised learning. Behzadan & Munir (2017) consider a tapping attacker, which observes and manipulates the states but does not know the learner's parameters.

C TARGET TYPES OF ATTACKING: MORE DETAILED EXPLANATION

π 0 π 1 π 2 O 1 Ǒ1 O 2 Ǒ2 M 1 M 2 (a) Poison observation (Or, Os) C.1 POISON OBSERVATION Consider a deep reinforcement learning algorithm, which uses its old policy to generate a series of observations and updates its policy with the collected observations at every iteration. The observations are stored in a temporary buffer in the form of (s, a, r, d), where s, r, d are returned by the environment, and a is produced by the policy itself. π 0 π 1 π 2 Ǒ1 Ǒ2 M1 M2 M 1 M 2 (b) Poison MDP (MR, MP ) π 0 π 1 π 2 Ǒ1 Ǒ2 π0 π1 M 1 M 2 (c) Poison Executor Ea An attacker could stay in the middle between the learner and the environment and falsify s, r returned by the environment before the learner receives them. (It is also possible to alter the terminal state flags d, but it is relatively easier for the learner to detect, and its influence to the learner is not as large as s and r, so we do not discuss this attack type.) On the other hand, the attacker may also hack the observation buffer to change s, r. In both cases, poisoning s and r are called O s -poisoning and O r -poisoning respectively. Note that in the case of hacking observation buffer, the attacker also has the option to manipulate a in O. But we do not include this kind of attack in observation poisoning, because not like states and rewards, the actions are taken by the learner, so it is easy for the learner to detect the change of action sequence stored in the buffer. And we will show in Section C.3 that changing a in O is similar to the case of executor poisoning.

C.2 POISON MDP

MDP poisoning can be considered as "changing the reality" for the learner. For example, an attacker decides to increase the reward for state-action pair (s, a) by ∆ at iteration t, it then changes the parameter R k (s, a) of the environment to Řk (s, a) = R k (s, a) + ∆. As a result, whenever the learner visits (s, a) in the current iteration, it receives Řk (s, a) as its reward. A more intuitive example is depicted in Figure 7 , where we want to train a mouse to find the cheese in a maze. However, if some bad man adds another piece of cheese in the training environment, the mouse is likely to be misled. Then, in the test time, the "malicious cheese" is removed, but the mouse still wants to find the malicious cheese instead of the original one. Compared with observation poisoning and executor poisoning, MDP Poisoning is more powerful and more difficult to defend against. An example of MDP poisoning is given by Rakhsha et al. (2020) , where the attacker can force the learner to learn a target policy with guarantees.

C.3 POISON EXECUTOR

In the online RL process, the learner learns by taking actions and getting feedbacks of the actions. If observation poisoning is viewed as changing the feedbacks, then executor poisoning can be viewed as perturbing the actions taken by the learner. For example, for an auto-driving agent, one can slightly add some force to the steering wheel, so when the agent takes "steer left 90 degrees", what actually happens is "steer left 100 degrees". In this way, the trained policy is biased. We now show that the aforementioned attack of changing a in O could be converted to executor poisoning. We describe the following two scenarios respectively for these two poisoning attacks and show they have equal effects. For simplicity, we only consider one step of the interaction, which can be simply extended to multiple steps. (1) Poisoning a in O. For one-step experience, the learner observes state s, takes action a i , receives reward r i = R(s, a i ) and observes a new state s i ∼ P (•|s, a i ), then the tuple (s, a i , r i , s i ) will be stored in the observation buffer. Now the attacker may change the action a i to a j , and the tuple becomes (s, a j , r i , s i ). Finally, the learner regard r i , s i and the following observations as caused by a j , instead of a i . (2) Poisoning Executor. In the same environment, suppose the learner observes state s, and takes action a j , then the attacker conducts executor poisoning and changes a j to a i , then the reward and the next state will also change to r i = R(s, a i ) and s i ∼ P (•|s, a i ). The learner does not know the falsification of action and stores (s, a j , r i , s i ) to the buffer. Finally, the attacker will take r i , s i and the future observations as caused by a j , and updates its policy accordingly. Based on the descriptions, we can see changing a i to a j in the buffer is equivalent to changing a j to a i on the executor. If we assume the policy always has non-zero probability on all actions, then for any manipulation on a in O, it is possible to achieve the same effects by attacking the executor. Hence, given the fact that the former attack can be trivially defended, we only discuss executor poisoning in our poisoning framework.

D ROBUSTNESS AND STABILITY OF RL ALGORITHMS D.1 MEASURING VULNERABILITY OF RL ALGORITHMS

Definition 1 in Section 4 measures the stability of one update based on policy discrepancy. But it is not obvious whether the rewards change drastically due to the attacks. Proposition 2 provides a guarantee on the performance of the poisoned policy π , compared with the un-poisoned new policy π . Proposition 2. For an update of stochastic policy π = f (π, O), if its δ-stability radius is ε with the total variance measure, then any poisoning effort smaller than ε on D will cause the expected total reward drop (η(π ) -η(π )) by no more than 4δ 2 γ max s,a |A π (s, a)| (1 -γ) 2 + 2δ s∈S g π (s) max a A π |(s, a)| (2) where γ is the discount factor, g π (s) := ∞ t=0 γ t P (s t = s|π) is the discounted visitation frequency, and A is the advantage function, i.e., A π (s, a) = Q π (s, a) -V π (s). Proposition 2 shows that if the attack power is within the stability radius, the reward of the poisoned policy will not be influenced too much. which also explains the motivation behind our proposed vulnerability-aware attack. The proof is in Appendix D.4. With Definition 1, the one-update stability measure, we are able to formally define the stability radius of an RL algorithm w.r.t. an MDP. Definition 3 (Stability Radius w.r.t an MDP). The δ-stability radius of an algorithm f in an MDP M is defined as the minimal stability radius of all observations drawn from the MDP, and all possible policies in policy space Π. If f is on-policy, then φ(f, M) = min π∈Π,O∼π φ(f, π, O); if f is off-policy, then φ(f, M) = min π∈Π,O∼π φ(f, π, O) , where π is the behavior policy;

D.2 ROBUSTNESS RADIUS

Different with poisoning, test-time evasion (adversarial examples) misleads the agent by manipulating the states only, since the agent no longer learns from interactions and feedbacks. Note that although Gleave et al. (2019) propose an attack called "adversarial policy", the perturbation does not happen in the policy, but still happens in the input states (observations) of the agent. To study how robust a trained policy is, we define the robustness radius with regard to both a single state and the whole environment. Definition 4 (Robustness Radius of Policy w.r.t. a State). The robustness radius of a deterministic policy π on a state s is defined as the minimal perturbation of s which changes the output action, i.e., ρ(π, s) = inf ε {∃š ∈ S ∩ B ε (s) s.t. π(s) = π(š)} Similarly, for any 0 < δ < 1, the δ-robustness radius of a stochastic policy π on a state s is defined as the minimal perturbation of s which makes the output action distribution disagrees with the original π(s) with probability more than δ, i.e., ρ(π, s) δ = inf ε {∃š ∈ S ∩ B ε (s) s.t. d π(•|s)||π(•|š) > δ} where d(•|•) could be any distance measure between two distributions. Remark. If we regard the policy as a classifier which "classifies" a state to an action, then the robustness radius of policy defined above is analogous to the robustness radius of classifiers defined by Wang et al. (2017) , with an extension to stochastic predictions. Definition 5 (Robustness Radius w.r.t an MDP). The (δ-)robustness radius of a policy π in an MDP M is defined as the maximal robustness radius of all states, i.e., ρ(π, M) = min s∈S ρ(π, s) Remarks. (1) A deterministic policy is robust against any state perturbation smaller than ρ(π, M). (2) For a stochastic policy, if its δ-robustness radius in an M is ε, then any state perturbation within ε will cause the expected total reward drop by no more than ( 2δγ (1 -γ) 2 + 2δ) max s,a |R(s, a)|, which is proven by Theorem 5 in Zhang et al. (2020a) .

D.3 VULNERABILITY COMPARISON: DIFFERENCE BETWEEN SL AND RL

To shed some light on understanding adversarial attacks in RL, we compare SL and RL in terms of their vulnerability to poisoning and adversarial examples. At test time, a policy network receives states as input, and returns probabilities of choosing each actions as output; a value network receives states (or state-action pairs) as input, and returns the corresponding value as the output. Thus, test-time RL systems are very similar to SL systems, as one can view the policy networks as classification networks, and value networks as regression networks. However, the key difference between evasion in RL and evasion in SL is, data samples are not independent in RL. A single adversarial example in SL test dataset may cause at most one misclassification instance, whereas an adversarial example in RL may case a drastic change of the gained rewards (e.g., by leading the agent to a "devastating" or "absorbing" state). At training time, SL systems and RL systems are significantly different, as Figure 5 shows. Even when the supervised learner also learns from data streams in an online manner, the training data are independent with the learner's classifier. In contrast, the distribution of training data samples changes as the learner updates its policy. Poisoning attacks against an SL system could alter the decision boundary, so that the learner makes wrong decisions for certain data samples. For an RL system, poisoning attacks could (1) alter the decision boundary so that the learner chooses bad actions for certain states, and also (2) change the following observations and interactions due to a different selection of action. In summary, an adversarial attacker may cause higher damages on RL systems than on SL systems, with the same power and budget. But it does not suggests attacking RL systems is easier than attacking SL systems. As every coin has two sides, the high uncertainty of the environment may help an attack reduce the learner's reward, but may also lead the learner to gain higher reward in the future (as shown in Section 5). Therefore, it is more challenging to successfully attack RL systems than SL systems with a specific goal.

D.4 PROOF OF PROPOSITION 2

Proof. According to the definition of δ-stability radius, for any poisoning effort within ε, the poisoned policy satisfies D max T V [π ||π ] ≤ δ (assume total variance D T V is the distance measure between policy distributions). We are interested in the difference of expected rewards η(π ) -η(π ). Define L π (π ) = η(π ) + s∈S g π (s) a∈A π (a|s)A π (s, a), where g is the discounted state visitation frequencies, i.e., (Schulman et al., 2015a) , we can get g π (s) := P (s 0 = s|π ) + γP (s 1 = s|π ) + γ 2 P (s 2 = s|π ) + • • • . Since D max T V [π ||π ] ≤ δ, follow Theorem 1 in paper |η(π ) -L π (π )| ≤ 4δ 2 γ max s,a |A π (s, a)| (1 -γ) 2 . ( ) So we have |η(π ) -η(π ) - s∈S g π (s) a∈A π (a|s)A π (s, a)| ≤ 4δ 2 γ max s,a |A π (s, a)| (1 -γ) 2 , which can be transformed to η(π ) -η(π ) ≤ 4δ 2 γ max s,a |A π (s, a)| (1 -γ) 2 - s∈S g π (s) a∈A π (a|s)A π (s, a). We upper bound the term -s∈S g π (s) a∈A π (a|s)A π (s, a) as below. Combining the above results, we obtain η(π ) -η(π ) ≤ 4δ 2 γ max s,a |A π (s, a)| (1 -γ) 2 + 2δ s∈S g π (s) max a |A π (s, a)|

E ALGORITHM

Assume the learner's policy π is parametrized by θ. How to solve the projected gradient descent. To solve (P), assume ∂θ k ∂ ř exists, one can use projected gradient descent to update r by using the chain rule: ∂η π θ k-1 (π θ k ) ∂ ř = ∂η π θ k-1 (π θ k ) ∂θ k ∂θ k ∂ ř . For Vanilla Policy Gradient (VPG) whose update rule is, θ k = θ k-1 + α∇ θ k-1 η(π θ k-1 , r), where ∇ θ k-1 η(π θ k-1 , r) = 1 N N i=1 ( T t=1 ∇ θ k-1 log π θ k-1 (a (i) t |s (i) t ))( T t=1 r (i) t ) , we can derive ∇ θ η π θ k-1 (π θ k ) ≈ 1 N N i=1 Π T t=1 π θ k (a (i) t |s (i) t ) π θ k-1 (a (i) t |s (i) t ) ( T t=1 ∇ θ k log π θ k (a (i) t |s (i) t ))( T t=1 γ t -1 r (i) t ) . (12) and [∇ r θ k ] t = t j=1 ∇ θ k-1 log π θ k-1 (a j |s j )γ t-j Although ∂θ k ∂ ř has a closed-form expression for simple learners like VPG, analytically computing how the poisoned reward influences the model is challenging for more complicated learners like PPO, whose update rule is an argmax function. Therefore, we use the Direct Gradient Method proposed by Yang et al. (2017) to approximate the gradient by  ∂η π θ k-1 (π θ k ) ∂ ř = η π θ k-1 (f (π θ k-1 , r + ∆)) -η π θ k-1 (f (π θ k-1 , r)) ∆ (i) t ) - T t =t γ t -t r (i) t ) 2 Imitate learner's update with the clean rewards θ k , η ← Update(θ k-1 , r) Initialize ř as the original r in O k Set η 0 = η c for j = 1, • • • , J do for i = 1, • • • , N and t = 1, • • • , T do 14 Copy r ← ř, and add a small value ∆ to [r ]  (i) t Break Compute ψ k = 1 N T s (i) t d(π θ k ||πθ k ) and add ψ k to Ψ if ψ k is larger than the (C -c)/(K -k) -th (i) t |s (i) t ) π θ (a (i) t |s (i) t ) )( T t =t γ t -t r (i) t -V ω (s i t )) return θ , η To show the concrete poisoning process, we assume D = O r . Then Algorithm 3 shows the detailed procedure of VA2C-P with a non-targeted goal and a white-box attacker. For Targeted Poisoning. In line 16, instead of computing ∂η ∂ ř(i) t , we compute ∂dist(θ ,θ † ) ∂ ř(i) t . For Black-box Poisoning. In line 7, instead of getting the learner's policy model θ k-1 , we train a policy with the same algorithm of the learner θk-1 ← f ( θk-2 , O k-1 ).

F THEORETICAL INTERPRETATION OF THE BI-LEVEL OPTIMIZATION PROBLEM

In this section, we discuss the problem relaxation made in Section 4.2. For notation simplicity, we focus on the case D = O r .

F.1 PROBLEM FORMS

Suppose the attacker has budget C = K. Then following the format of Problem equation Q, we could define the original O r -poisoning optimization problem for the poisoning at the k-th iteration as Problem equation P * . Note that for notation simplicity, we use the policy parameter θ to denote the policy π θ , so that η(θ) := η(π θ ). argmin řk ,••• , řK K j=k η(θ j ) s.t. θ j = f (θ j-1 , řj ) řj -r j ≤ , ∀j = k, • • • , K (P * ) Note that we only optimize on řk to řK , since previous k -1 decisions have already been made at the k-th iteration. Although an omniscient attacker is able to predict all the observations and solve Problem equation P * directly, for the non-omniscient attacker that we focus on, Problem equation P * is not solvable because of the unknown observations in iteration k + 1 to K. Hence, as discussed in Section 4, we relax equation P * , a multi-variable optimization problem, to (K -k + 1) sequential single-variable optimization problems argmin řk η(θ k ) s.t. θ k = f (θ k-1 , řk ) řk -r k ≤ (P k ) argmin řk+1 η(θ k+1 ) s.t. θ k+1 = f (θ k , řk+1 ) řk+1 -r k+1 ≤ (P k+1 ) and all the way to argmin řK η(θ K ) s.t. θ K = f (θ K-1 , řK ) řK -r K ≤ (P K ) where r j ∼ θ j-1 , ∀j = k, • • • , K. Note that r k is already generated and known for all the above problems. When solving Problem (P j ), r j and θ j are also known (determined by θ j-1 ).

F.2 TIGHTNESS OF PROBLEM RELAXATION

Problem equation P * is a (K -k + 1)-variable optimization problem with (K -k + 1) equality constraints and (K -k + 1) inequality constraints. And Problem equation P k , • • • , equation P K are (K -k + 1) single-variable optimization problems respectively with 1 equality constraints and 1 inequality constraints. The two sets of problems are naturally equivalent if θ k , • • • , θ K , as well as the constraints are independent. However, due to the online learning process (Figure 6 ), θ k+1 and r k+1 are all dependent on θ k and řk , which makes the relaxation not necessarily optimal. We call the optimal solution to Problem equation P * as (ř * k , • • • , ř * K ), and the optimal solutions to Problem equation P k to equation P K as ř# k , • • • , ř# K . For simplicity, we assume the environment and the policies are all deterministic. So that the value of observed reward r j is a deterministic function of θ j-1 . We claim that the relaxation does not change the feasibility as stated in Proposition 6. Proposition 6. (ř # k , • • • , ř# K ) is a feasible solution to Problem equation P * . Proof. Note that r k is known and the same for both equation P * and equation P k . (1) ř# k is feasible to equation P * because it satisfies ř# k -r k ≤ . (2) For Problem equation P * , with θ k = f (θ k-1 , ř# k ), r k+1 is the same with the r k+1 in Problem P k+1 . Since the optimal solution to equation P k+1 , ř# k+1 , satisfies ř# k+1 -r k+1 ≤ , ř# k+1 is also feasible to equation P * . (3) By induction, {ř # j } K j=k all satisfy the constraints in equation P * at the same time, so (ř # k , • • • , ř# K ) is a feasible solution to Problem equation P * . Note that if the environment or the policy is stochastic, then r j is a random variable sampled from some distribution defined by θ j-1 . In this case, the constraints for Problem equation P * should be Pr( řjr j ≤ ) ≥ t, ∀j = k, • • • , K, where t ∈ [0, 1] is a threshold probability. Then, with appropriate t, Proposition 6 also holds. So far we have shown that the relaxation is feasible, so that the attacker will not plan for a nonfeasible poisoning with VA2C-P. Next, we discuss the optimality of the relaxation. We first make the following assumptions.  j = k, • • • , K, ∂η(θ j ) ∂θ j ∂θ j ∂ řj | řj = ř# j = K j =j+1 ∂η(θ j ) ∂θ j ∂θ j ∂θ j -1 • • • ∂θ j+1 ∂θ j ∂θ j ∂ řj | řj = ř# j Proof. Consider the case K = 2 for simplicity, and the results naturally extend to a larger K. At iteration 1, Problem equation P * becomes argmin ř1, ř2 η(θ 1 ) + η(θ 2 ) s.t. θ 1 = f (θ 0 , ř1 ) θ 2 = f (θ 1 , ř2 ) ř1 -r 1 ≤ ř2 -r 2 ≤ (P 0) where θ 0 and r 1 are known. The relaxed problems are argmin ř1 η(θ 1 ) s.t. θ 1 = f (θ 0 , ř1 ) ř1 -r 1 ≤ (P 1) where θ 0 and r 1 are known, and argmin ř2 η(θ 2 ) s.t. θ 2 = f (θ 1 , ř2 ) ř2 -r 2 ≤ (P 2) where θ 1 and ř2 are determined by ř1 , the solution to equation P 1. Suppose ř# 1 and ř# 2 are the optimal solutions to equation P 1 and equation P 2. And we discuss the necessary condition for ř# 1 and ř# 2 being optimal to equation P 0. We can rewrite equation P 1 as argmin ř1 η(f (θ 0 , ř1 )) s.t. ř1 -r 1 2 -2 + y 2 1 = 0 And the Lagrange function of the above problem is L(ř 1 , λ 1 , y 1 ) = η(f (θ 0 , ř1 )) + λ 1 ( ř1 -r 1 2 -2 + y 2 1 ) The necessary conditions for ř1 being optimal are ∂η(f (θ 0 , ř1 )) ∂ ř1 + ∂λ 1 ( ř1 -r 1 2 -2 + y 2 1 ) ∂ ř1 = 0 (17) ř1 -r 1 2 -2 + y 2 1 = 0 (18) λ 1 y 1 = 0 (19) for some λ 1 and y 1 . Similarly For Problem equation P 0, the Lagrange is L(ř 1 , ř2 λ 1 , λ 2 , y 1 , y 2 ) = η(f (θ 0 , ř1 ))+λ 1 ( ř1 -r 1 2 -2 +(y 1 ) 2 )+η(f (θ 2 , ř2 ))+λ 2 ( ř2 -r 2 2 -2 +(y 2 ) 2 ) (25) And the necessary conditions for ř1 , ř2 being optimal are ∂η(f (θ 0 , ř1 )) ∂ ř1 + ∂η(f (θ 1 , ř2 )) ∂ ř1 + ∂λ 1 ( ř1 -r 1 2 -2 + (y 1 ) 2 ) ∂ ř1 + ∂λ 2 ( ř2 -r 2 2 -2 + (y 2 ) 2 ) ∂ ř1 = 0 (26) ∂η(f (θ 1 , ř2 )) ∂ ř2 + ∂λ 2 ( ř2 -r 2 2 -2 + (y 2 ) 2 ) ∂ ř2 = 0 (27) ř1 -r 1 2 -2 + (y 1 ) 2 = 0 (28) ř2 -r 2 2 -2 + (y 2 ) 2 = 0 (29) λ 1 y 1 = 0 (30) λ 2 y 2 = 0 (31) for some λ 1 , λ 2 , y 1 , y 2 (not the same with λ 1 , λ 2 , y 1 , y 2 ). Expanding for some ζ, which is the necessary condition for ř# 1 and ř# 2 being optimal to Problem equation P 0. Intuitively, this condition implies that the gradient of η(θ 1 ) w.r.t. ř# 1 (the RHS) should be aligned with the gradient of η(θ 2 ) w.r.t. ř# 1 (the LHS), without considering the influence of ř1 to ř2 . That is, although ř1 influence θ 1 , θ 1 influences both θ 2 and ř2 (because ř2 is generated by π θ1 ), LHS does not include ∂η ∂θ2 ∂θ2 ∂ ř2 ∂ ř2 ∂θ1 ∂θ1 ∂ ř1 , which makes equation 34 computable in many cases. However, for the setting of our VA2C-P (monitoring attacker for online RL), the attacker at iteration k = 1 does not know the observed reward r 2 of iteration k = 2, which prevent the agent from conducting the optimal attack. But equation 34 could help the attacker verify whether a past poison is likely to be optimal for the iterations so far. For example, if the learner is using VPG, then at iteration k = 2, the attacker can test whether its previous poison ř1 did a good job in minimizing η(θ 1 ) + η(θ 2 ) by evaluating whether (I + ∇ 2 θ1 η(θ 1 ))∇ θ2 η(θ 2 ) is equal to ∇ θ1 η(θ 1 ).

G EXPERIMENT SETTINGS AND ADDITIONAL RESULTS

G.1 DETAILED EXPERIMENT SETTINGS Network Architecture. For all the learners, we use a two-layer policy network with T anh as the activation function, where each layer has 64 nodes. PPO, A2C and ACKTR also have an additional same-sized critic network. We implement VPG and PPO with PyTorch, and the implementation of A2C and ACKTR are modified from the project by Kostrikov (2018) . Hyper-parameters. In all experiments, the discount factor γ is set to be 0.99. We run VPG and PPO for 1000 episodes on every environment, and update the policy after every episode. For A2C and ACKTR, we use 16 processes to collect observations simultaneously, and update policy every 5 steps (i.e., each observation O has 80 (s, a, r) tuples); learning last for 80000 steps in total. Distance Measure for Perturbation The definition of total effort function U (•) plays an important role of understanding the attack power. Since states, actions and rewards are in different forms and scales, we define U differently for various poison aims. Also, note that O is a concatenation of state-action-reward tuples, and its length could vary in different iterations, so we normalize over the length of observation. For D = O s , we define the total effort as U (O s , Ǒs ) = 1 |O s | s∈O s s -š 2 . For D = O a , if the action space is continuous, we define the total effort as U (O a , Ǒa ) = 1 |O a | a∈O a a -ǎ 2 , and if the action space is discrete, we define the total effort as U (O a , Ǒa ) = 1 |O a | a∈O a 1(a = ǎ). For example, in CartPole, the action is either 0 or 1, then the attacker with = 0.1 can flip up to 10% of the actions in one iteration. For D = O r , we define the total effort as U (O r , Ǒr ) = 1 |O r | r -ř . In the supplementary materials we provide the code and instructions, as well as demo videos of poisoning A2C in the Hopper environment, where one can see under the same budget constraints, random poisoning has nearly no influence the agent's behaviors, while our proposed VA2C-P successfully prevents the agent from hopping forward.

G.2 FGSM-BASED POISONING

The procedure of the FGSM-based targeted poisoning is as follows. We transfer the method proposed by Behzadan & Munir (2017) from attacking DQN to attacking policy-based algorithms. Although Behzadan & Munir (2017) assume a black-box setting, we hereby use white-box attack (attacker knows learner's policy parameters) in order to let FGSM be a stronger baseline. For step t, Step 1: the learner observes s t takes action a t , the environment returns reward r t , state s t+1 ; Step 2: the attacker queries the target policy and gets a adv = π † (s t+1 ); Step 3: the attacker poisons s t+1 by št+1 = s t+1 + × sign(∇ st+1 (π(a adv |s t+1 ))); Step 4: the attacker sends št+1 as s t+1 to the learner.

G.3 ADDITIONAL EXPERIMENT RESULTS

Figure 8 shows additional results on various environments and RL algorithms, including both nontargeted poisoning and targeted poisoning. 



In this paper, we assume the attacker poisons the observation(trajectories), which is the most universal setting in practice. Appendix B extends our problem formulation to a more general case where the attacker can change the underlying MDP. There are many choices of U (•, •). For example, the total effort w.r.t. O s -poisoning can be the average p-distance between any poisoned and unpoisoned state in O s and Ǒs . In order to characterize the stability and robustness of RL algorithms in a principled way, we provide more measures for the vulnerability of RL in both training time and test time. See Appendix D.2) for details. CONCLUSION AND DISCUSSIONIn this paper, we propose VA2C-P, the first generic poisoning algorithm for deep policy-gradient online RL methods, which incorporates heterogeneous poisoning models and does not require any prior knowledge of the environment. Although the effectiveness of VA2C-P is verified in a wide range of RL algorithms and environments, we acknowledge that poisoning RL in practice is still challenging, due to the relatively high computational burden and the uncertainty of the environments. These challenges are also exciting opportunities for future work. For algorithms with experience replay, the update can be extended as π k+1 = f (π k , O 1:k ). There are many ways to define the distance between two policies, for instance KL-divergence for stochastic policies(Schulman et al., 2015a), and average mismatch for deterministic policies(Rakhsha et al., 2020).



Figure 1: An example of difficult poisoning.

Figure 2: Online poisoning-learning process.

Obtain the observation (trajectories) O k obtained by the learner Update the adversarial critic Ṽω with observation O k if White-box then Copy the learner's policy parameters to the imitating policy of attacker: π ← π k Compute the clean next-policy π ← f (π, O k ) Solve Problem (P) with power and critic Ṽω , poison O k to Ǒk Compute the poisoned next-policy π ← f (π, Ǒk ) Estimate the policy discrepancy ψ k between π and π Add ψ k to the list of policy discrepancies Ψ if ψ k ≥ C-c K-k -th largest element in Ψ then Send the learner the poisoned observation Ǒk if Black-box then Update the imitating policy as the clean next-policy: π ← π else Send the clean observation O k if Black-box then 22 Update the imitating policy as the poisoned next-policy: π ← π Algorithm 1 illustrates how an attacker can poison an online RL learner with our proposed VA2C-P, which is corresponding to Line 4 in Procedure 2 shown in Appendix B . More implementation details are in Algorithm 3 in Appendix E.

Figure 3: Comparison of mean per-episode reward gained by VPG, PPO, A2C, ACKTR on various environments, under no poisoning, random poisoning, AC-P and VA2C-P.

Figure 4: (a) VA2C-P also works under black-box setting; (b) hybrid-aim poisoning could be better than single-aim poisoning; (c)(d) VA2C-P successfully forces the agent to choose the target policy.

);Inkawhich et al. (2020), Adversarial attacks can be utilized to train robust policiesPinto et al. (2017);Pattanaik et al. (2018). There is also a line of work focusing on adversarial examples in path-finding problemsXiang et al. (2018);Bai et al. (2018). These evasion attacks are created to fool a well-trained policy, but they can not change the policy itself in training time as poisoning does.

Published as a conference paper at ICLR 2021 Procedure 2: Flow of Online RL Poisoning Learner initializes its initial policy π θ0 for k = 1, • • • , K do Learner rollouts observation (trajectories) O k in environment M with current policy π k Attacker may poison the observation O k to Ǒk

Figure 5: Different poison aims of poisoning in supervised learning and reinforcement learning. Poison Aim I -Poison Observation (O r , O s ). The attacker could manipulate the observation of the learner, i.e., change O into Ǒ. This may happen when the attacker is able to intercept the communication between the learner and the environment, similar to the man-in-the-middle attack in cryptography. The attacker could target the rewards, called O r -poisoning, studied by Huang & Zhu (2019); or the states, called O s -poisoning, investigated by Behzadan & Munir (2017).Poison Aim II -Poison MDP (M R , M P ). An attack could directly change the MDP (environment) that the learner is interacting with, i.e., change M into M. For example, a seller could influence the behaviors of customers by changing the prices of products. The poison of MDP could be injected at the reward model R or the transition dynamics P , respectively denoted as M R -poisoning (studied byMa et al. (2019)) and M P -poisoning (studied byRakhsha et al. (2020)) . The analogy of poison MDP in SL is to manipulate the underlying data distribution of the training data.

Figure 6: The online poisoning vs learning. Blue solid lines denote the learning processes, while red dashed lines denote the poisoning processes. In iteration k, the learner uses its previous policy π k-1 to roll out observations O k from current MDP M k , then updates its model and policy by πk = f (π k-1 , O k ) = argmax π J(π, π k-1 , O).The attacker may (a) poison observations after they are generated, (b) poison MDP before the learner generates observations, or (c) poison the policy when it is used to generate observations. Among these three scenarios, O k is always influenced by the attack, thus denoted as Ǒk .

Figure 7: An intuitive example of MDP poisoning.

)A π (s, a) + a∈A π (a|s)A π (s, a) -a∈A π (a|s)A π (s, a) = s∈S g π (s) a∈A A π (s, a)(π (a|s) -π (a|s)) -a∈A π (a|s)A π (s, a)

14)Algorithm 3: Non-targeted White-box VA2C-P with O r -Poisoning Input: total iterations of learning K; poisoning power ; poisoning budget C; attacker's learning rate β; maximum computing iterations J; distribution distance measure d Initialize policy discrepancies as an empty list Ψ = ∅ Initialize the number of already poisoned iterations c = 0Initialize value network V ω for k = 1, • • • , K do if c > C thenBreak Get the current observation O k and the learner's policy model θ k-1 Fit the value function: ω ← argmin ω τi∈O k T t=1 (V ω (s

largest element in Ψ then Attack: replace r with ř in O k and send it back to the learner c ← c + 1 Procedure Update(θ, r) Perform an update with the learner's algorithm θ ← f (θ, r)

Figure 9: Poisoning off-policy algorithm SAC with VA2C-P. D = O r , C/K = 1, = 0.6.

Then, the attacker predicts the next-policy πj+1 under poisoned observation, based on the learner's update rule argmax π J(π, πj , Ǒj | Ďj ), where D ∈ {O s , O a , O r } stands for the poison aim of the poisoning, Ǒ| Ď denotes that O is poisoned into Ǒgiven that poison aim D is poisoned into Ď. However, the imitating policy π may or may not be the same as the actual learner's policy, depending on the attacker's knowledge. As introduced in Section 2, we deal with both white-box and black-box attackers, and both of them do not know the environment M. A white-box attacker knows the current and past observations O 1:k , the learner's algorithm f and policy π, so it can directly copy the policy πj = π j , ∀j. A black-box attacker knows the current and past observations O 1:k , but does not know the learner's policy π. In this case, the attacker has to estimate π at every iteration. Section 4.3 states how to guess π.

environments with increasing difficulty levels: CartPole, LunarLander, Hopper, Walker and HalfCheetah. All results are averaged over 10 random seeds. The total effort U is calculated by the normalized 2 -distance. See Appendix G.1 for the expression of U , as well as more hyper-parameter settings.

Assumption 1: the learner's update function f (θ, ř) is differentiable w.r.t. ř and θ.Assumption 2: the environment and the policies are all deterministic, and the reward r generated by a policy π θ is differentiable w.r.t. θ (i.e. R(s, π θ (s) is differentiable w.r.t. θ). If Assumption 1 and 2 hold, the necessary condition for (ř # k , • • • , ř# K ) being an optimal solution to Problem equation P * is, for all

, for Problem equation P 2, the necessary conditions of optimality are are the optimal solutions to equation P 1 and equation P 2, ř# 1 and ř# 2 satisfy Equation equation 17 ∼ equation 22.

equation 26 and equation 27, we get

ACKNOWLEDGMENTS

This work is supported by National Science Foundation IIS-1850220 CRII Award 030742-00001 and DOD-DARPA-Defense Advanced Research Projects Agency Guaranteeing AI Robustness against Deception (GARD), and Adobe, Capital One and JP Morgan faculty fellowships.

G.4 EXTENSION TO OFF-POLICY LEARNERS

Although we focus on on-policy policy gradient learners in this paper, our poisoning method is also applicable to off-policy learners which update their policies using sampled mini-batches from all historical observation (trajectories). If the adversary can manipulate the mini-batch the learner samples at every step, our proposed poisoning process works as usual. We implement this idea and test it for one of the state-of-the-art off-policy learning method SAC (Haarnoja et al., 2018) , and the results are shown in Figure 9 , where VA2C-P significantly reduces the reward gained by the learner. In the other case, if the adversary doesn't see which mini-batch the learner samples but has access to the buffer, he can still alter or insert some samples to influence learning.

