PALM: PREFERENCE-BASED ADVERSARIAL MANIPU-LATION AGAINST DEEP REINFORCEMENT LEARNING

Abstract

Deep reinforcement learning (DRL) methods are vulnerable to adversarial attacks such as perturbing observations with imperceptible noises. To improve the robustness of DRL agents, it is important to study their vulnerability under adversarial attacks that would lead to extreme behaviors desired by adversaries. Preferencebased RL (PbRL) aims for learning desired behaviors with human preferences. In this paper, we propose PALM, a preference-based adversarial manipulation method against DRL agents which adopts human preferences to perform targeted attacks with the assistance of an intention policy and a weighting function. The intention policy is trained based on the PbRL framework to guide the adversarial policy to mitigate restrictions of the victim policy during exploration, and the weighting function learns weight assignment to improve the performance of the adversarial policy. Theoretical analysis demonstrates that PALM converges to critical points under some mild conditions. Empirical results on a few manipulation tasks of Meta-world show that PALM exceeds the performance of stateof-the-art adversarial attack methods under the targeted setting. Additionally, we show the vulnerability of the offline RL agents by fooling them into behaving as human desires on several Mujoco tasks. Our code and videos are available in https://sites.google.com/view/palm-adversarial-attack.

1. INTRODUCTION

Adversarial examples in image classifiers have prompted a new field of studying the vulnerability of deep neural networks (DNN). Recent researches demonstrate that reinforcement learning (RL) agents parameterized by DNN also show vulnerability under adversarial attacks (Huang et al., 2017; Pattanaik et al., 2018; Zhang et al., 2020; 2021; Sun et al., 2022) . Adversaries generate imperceptible perturbations to the observations of victim agent, making agents fail to complete the original behaviors. While adversarial attack is an crucial approach to evaluate the vulnerability of the agents, targeted attack in RL has received little attention. Recently, embodied intelligence (Gupta et al., 2021; Liu et al., 2022; Ahn et al., 2022; Reed et al., 2022; Fan et al., 2022) is considered as a meaningful way to improve the cognitive ability of artificial intelligence. These embodied agents show powerful capabilities while possibly exposing vulnerability. Therefore, we wonder that: How can one manipulate the agent to perform desired behaviors, and whether the embodied agents are robust to adversarial manipulations? To achieve targeted adversarial attacks, one straight way is to design respective rewards for the adversary agents. However, specifying a precise reward function can be challenging. For example, it is difficult to design a reward function denoting the goodness of the current step in the game of Go. For example, it is difficult in chess games to craft a reward function which can identify the quality of each move. In preference-based RL framework, a human only needs to provide binary preference labels over two trajectories of the agent (Christiano et al., 2017) . Compared to reward engineering, preference-based RL is an easier way to learn policies through human preferences. Meanwhile, recent research on preference-based RL shows an excellent capacity to learn novel behaviors with few preference labels (Lee et al., 2021a; Park et al., 2022; Liang et al., 2022) , and significantly improves feedback efficiency. Motivated by this, we consider using preference-based RL to perform targeted adversarial attacks from the following perspectives. On the one hand, it is difficult to define the desired behaviors to achieve targeted attacks, but humans can implicitly inject intentions by providing binary preferences. On the other hand, preference-based RL is a data-efficient manipulation method because a few preference labels are enough to learn well-behaved policies. As shown in Figure 1 , we emphasize PALM can recover the reward function of desired behaviors via preference-based RL framework and manipulate agents to perform human desired behaviors. In this paper, we propose Preference-based AdversariaL Manipulation (PALM) algorithm, which performs targeted attack from human preferences. PALM includes an adversary to perturb the observations of the victim. To achieve targeted attack and better exploration, we introduce an intention policy, which is learned from human preferences, as the learning target of the adversary. Additionally, we utilize a weighting function to assist adversary learning by re-weighting state examples. In summary, our contributions are three-fold. Firstly, to the best of our knowledge, we propose the first targeted adversarial attack method against DRL agents via preference-based RL. Secondly, we theoretically analyze PALM and provide a convergence guarantee under some mild conditions. Lastly, we design two scenarios and experiments on Meta-world demonstrate that PALM outperforms the baselines by a large margin. Empirical results demonstrate that both online and offline RL agents are vulnerable to our proposed adversarial attacks.

2. RELATED WORK

Many previous works on adversarial attacks study the vulnerability of a DRL agent. Huang et al. (2017) computes adversarial perturbations via utilizing the technique of FGSM (Goodfellow et al., 2015) to mislead the victim policy, not to choose the optimal action. Pattanaik et al. (2018) presents an approach that leads the victim to select the worst action based on the Q-function of the victim. Gleave et al. (2020) conducts adversarial attacks under the two-player Markov game instead of perturbing the agent's observation. Zhang et al. (2020) proposes the state-adversarial MDP (SA-MDP) and develops two adversarial attack methods named Robust Sarsa (RS) and Maximal Action Difference (MAD). SA-RL (Zhang et al., 2021) directly optimizes the adversary policy to perturb state in the form of end-to-end RL. PA-AD (Sun et al., 2022 ) designs an RL-based "director" to find the optimal policy perturbing direction and construct an optimized-based "actor" to craft perturbed states according to the given direction. Methods of untargeted adversarial attack focus on making the victim policy fail, while our approach emphasizes manipulating the victim policy. That is to say, victim's behaviors are consistent with the preference of the manipulator under attacks. Another line of works (Pinto et al., 2017; Mandlekar et al., 2017; Pattanaik et al., 2018) consider using adversarial examples to improve the robustness of policies, although it is out of the scope of this paper. There are a few prior researches that focus on targeted attacks on RL agents. Lin et al. (2017) first proposes a targeted adversarial attack method against DRL agents. It attacks the agent to reach a targeted state. Buddareddygari et al. (2022) also present a strategy to mislead the agent towards to a specific state by placing an object in the environment. The hijacking attack (Boloor et al., 2020) is proposed to attack agents to perform targeted actions on autonomous driving systems. Hussenot et al. (2019) provides a new perspective that attacks the agent to imitate a target policy. Our method differs that PALM manipulates victim behave as human desire and focuses on the preference-based RL. Xiao et al. (2019) proposes the first adversarial attack method against real world visual navigation robot. Lee et al. (2021b) investigates targeted adversarial attacks against the action space of the agent. Our method differs that PALM leverages preference-based RL to avoid reward engineering In the inner-level, the adversary is optimized to approach the intention policy which learns via preference-based RL. In the outer-level, the weighting function is updated to maximize the performance of the adversary evaluated by the outer loss. and learns an intention policy to tackle restricted exploration problem, so that PALM can attack the victim policy to perform behaviors far from its original behaviors. Training agents with human feedback has been investigated in several works. Preference-based RL provides an effective way to utilize human preferences for agent learning. Christiano et al. (2017) proposes a basic learning framework for preference-based RL. To further improve feedback efficiency, Ibarz et al. (2018) additionally utilizes expert demonstrations to initialize the policy besides learning the reward model from human preferences. However, previous methods need plenty of human feedback, which is usually impractical. Many recent works have proposed to tackle this problem. Lee et al. (2021a) presents a feedback-efficient preference-based RL algorithm, which benefits from unsupervised exploration and reward relabeling. Park et al. (2022) further improves feedback efficiency by semi-supervised reward learning and data augmentation, while Liang et al. (2022) proposes an intrinsic reward to enhance exploration. To the best of our knowledge, our method is the first to achieve a targeted adversarial attack against DRL agents through preference-based RL.

3. PROBLEM SETUP

The Victim Policy. In RL, agent learning can be modeled as a finite horizon Markov Decision Process (MDP) defined as a tuple (S, A, R, P, γ). S and A denote state and action space, respectively. R : S × A × S → R is the reward function and γ ∈ (0, 1) is the discount factor. P : S × A × S → [0, 1] denotes the transition dynamics, which determines the probability of transferring to s ′ given state s and action a. We denote the stationary policy π ν : S → P(A), where ν are parameters of the victim. We suppose the victim policy is fixed and uses the approximator. The Adversarial Policy. To study the adversary learning with human preferences, we formulate it as rewarded state-adversarial Markov Decision Process (RSA-MDP). Formally, a RSA-MDP is a tuple (S, A, B, R, P, γ). The adversary π α : S → P(S) perturbs the states before the victim observes them, where α are parameters of the adversary. Specifically, the adversary perturbs the state s into s which is restricted by B(s) (i.e., s ∈ B(s)). B(s) is defined as a small set {s ∈ S :∥ ss ∥ p ≤ ϵ}, which limits the attack power of the adversary and ϵ is attack budget. Since directly generating s ∈ B(s) is hard, the adversary learns to produce a Gaussian noise ∆ with ℓ ∞ (∆) less than 1, and we obtain the perturbed state through s = s + ∆ * ϵ. The victim takes action according to the observed s, while true states in the environment are not changed. The perturbed policy is denoted as π ν•α . Different from SA-MDP (Zhang et al., 2020) , RSA-MDP introduces R, which is consistent with human preferences. The target of RSA-MDP is to solve the optimal adversary π * α , which enables the victim to achieve the maximum expected return over all states. Lemma 1 shows that solving the optimal adversary in RSA-MDP is equivalent to finding the optimal policy in MDP M = (S, Â, R, P, γ), where Â = S and P is the transition dynamics of the adversary.

4. METHOD

In this section, we introduce our method PALM, which leverages preference-based RL to achieve targeted attack against DRL agents. The core idea of PALM, on the one hand, is to learn an intention policy as the learning target of the adversarial policy to tackle restricted exploration problem. On the other hand, PALM takes advantage of feedback-efficient preference-based RL method PEB-BLE (Lee et al., 2021a) to avoid reward engineering. Also, we introduce a weighting function to improve the performance of the adversary and formulate PALM as a bi-level optimization algorithm. The framework of PALM is shown in Figure 2 and detailed procedure is summarized in Appendix A.

4.1. LEARNING INTENTION POLICY

PALM aims to find the optimal adversary that manipulates the victim's behaviors to be consistent with human intentions. However, the victim policy is pre-trained to complete a specific task, directly learning an adversary suffer from exploration difficulty caused by the restriction of victim policy, making it hard to find an expected adversarial policy efficiently. Therefore, we introduce an intention policy π θ which has unrestricted exploration space to guide adversarial policy training. To achieve targeted attack and avoid reward engineering, we inject human intentions into the intention policy via preference-based RL framework, which is shown in Figure 3 . In preference-based RL, the agent have no access to the ground-truth reward function. To learn a reward function, humans provide preference labels between two trajectories of the agent and the reward function r ψ learns to align with the preferences (Christiano et al., 2017) . Formally, a segment σ of length k is denoted as a sequence of states and actions {s t+1 , a t+1 , • • • , s t+k , a t+k }. A human expert is required to give a label y of a pair of segments (σ 0 , σ 1 ) to indicate which segment is preferred, where y ∈ {(0, 1), (1, 0), (0.5, 0.5)}. Following Bradley-Terry model (Bradley & Terry, 1952) , a preference predictor is constructed in (1): P ψ [σ 0 ≻ σ 1 ] = exp t r ψ (s 0 t , a 0 t ) exp t r ψ (s 0 t , a 0 t ) + exp t r ψ (s 1 t , a 1 t ) , where σ 0 ≻ σ 1 denotes σ 0 is preferred to σ 1 . This predictor indicates the probability that a segment is preferred is proportional to its exponential return. Then, the reward function is optimized by aligning the predicted preference labels with human preferences through cross-entropy loss: L(ψ) = - E (σ 0 ,σ 1 ,y)∼D y(0) log P ψ [σ 0 ≻ σ 1 ] + y(1) log P ψ [σ 1 ≻ σ 0 ] , where D is a dataset of triplets (σ 0 , σ 1 , y) consisting of segment pairs and human preference labels. By minimizing (2), we obtain a reward function estimator r ψ , which is used to provide estimated rewards for agent learning via any RL algorithms. Following PEBBLE (Lee et al., 2021a) , we use an off-policy actor-critic method SAC (Haarnoja et al., 2018) to learn a well-performing policy. Specifically, the Q-function Q ϕ is optimized by minimizing the Bellman residual: J Q (ϕ) = E τt∼B Q ϕ (s t , a t ) -r t -γ V (s t+1 ) 2 , where V (s t ) = E at∼π θ Q φ(s t , a t )-µ log π θ (a t |s t ) , τ t = (s t , a t , r t , s t+1 ) is the transition at time step t, φ is the parameter of the target soft Q-function. The policy π θ is updated by minimizing (4): J π (θ) = E st∼B,at∼π θ µ log π θ (a t |s t ) -Q ϕ (s t , a t ) , ( ) where µ is the temperature parameter. By learning an intention policy, PALM tackles restricted exploration problem and provides an attack target for the following adversary training.

4.2. LEARNING ADVERSARIAL POLICY AND WEIGHTING FUNCTION

To make the victim policy perform human desired behaviors, PALM learns the adversary by minimizing the KL divergence between the perturbed policy π ν•α and the intention policy π θ . However, different states may have various importance to induce the victim policy to the target. To stabilize training process and improve the performance of the adversary, we introduce a weighting function h ω to re-weight states in adversary training. We formulate PALM as a bi-level optimization algorithm, which alternately updates the adversarial policy π α and the weighting function h ω through inner and outer optimization. In the inner level, PALM optimizes α with the importance weights outputted by a weighting function h ω , and optimizes ω in the outer level according to the performance of the adversary. Intuitively, the adversary learns to approach the intention policy in the inner level, while the weighting function learns to improve the performance of the adversary by evaluating the performance of the adversary through a meta-level loss. The whole objective of PALM is: min ω J π (α(ω)), s.t. α(ω) = arg min α L att (α; ω, θ). Inner-level Optimization: Training adversarial policy π α . In the inner-level optimization, given the intention policy π θ and the weighting function h ω , we hope to find the optimal adversarial policy by minimizing the re-weighted KL divergence between π ν•α and π θ in ( 6): L att (α; ω, θ) = E s∼B h ω (s)D KL (π ν•α (s) ∥ π θ (s)) , where h ω (s) is the importance weights outputted by the weighting function h ω . Intuitively, the adversarial policy is optimized to make the perturbed policy be close to the intention policy, while h ω assigns different weights to states of various importance. With the collaborative assistance of the intention policy and the weighting function, PALM efficiently learns an optimal adversarial policy. Outer-level Optimization: Training weighting function h ω . As for the outer-level optimization, we need to find a precise weighting function to balance the state distribution and assign proper weights to propel adversary learning. The weighting function is trained to distinguish the importance of states by evaluating the performance of the perturbed policy. Specifically, the perturbed policy π ν•α is evaluated using a policy loss in ( 7), which is adapted from the policy loss in (4): J π (α(ω)) = E st∼B,at∼π ν•α(ω) µ log π ν•α(ω) (a t |s t ) -Q ϕ (s t , a t ) , where α(ω) denotes α implicitly depends on ω. Therefore, PALM calculates the implicit derivative of J π (α(ω)) with respect to ω and finds the optimal ω * by optimizing (7). To make it feasible, we make an approximation of arg min α with the one-step gradient update. ( 8) obtains an estimated arg min α with one-step updating and builds a connection between α and ω: α(ω) ≈ α t -η t ∇ α L att (α; ω, θ)| αt . According to the chain rule, the gradient of the outer loss with respect to ω can be expressed as: ∇ ω J π (α(ω))| ωt = ∇ αJ π (α(ω))| αt ∇ ω αt (ω)| ωt = s f (s) • ∇ ω h(s)| ωt , where f (s) = -η t • (∇ αJ π (α(ω))) ⊤ ∇ α D KL (π ν•α (s) ∥ π θ (s)) and detailed derivation can be found in Appendix B. The key to obtain this meta gradient is building and computing the relationship between α and ω. Obtaining the implicit derivative, PALM updates the parameters of the weighting function by taking gradient descent with outer learning rate. In addition, we theoretically analyze the convergence of PALM in Theorem 1 and 2. In Theorem 1, we demonstrate the convergence rate of the outer loss, i.e. the gradient of the outer loss with respect to ω will convergence to zero. Thus PALM learns a more powerful adversary using importance weights outputted by the optimal weighting function. In Theorem 2, we prove the convergence of the inner loss. The inner loss of PALM algorithm converges to critical points under some mild conditions, which ensures the parameters of the adversary can converge to the optimal parameters. Theorems and proofs can be found in Appendix D.

5. EXPERIMENTS

In this section, we evaluate our method on several robotic simulated manipulation tasks from Metaworld (Yu et al., 2020) and continuous locomotion tasks from Mujoco (Todorov et al., 2012) . Specifically, our experiment contains two essential phases. In the first phase, we verify the efficacy of the proposed method through two scenarios: navigation and opposite behaviors. Furthermore, we show the capability of our approach by fooling a popular offline RL method, Decision Transformer (Chen et al., 2021) , into acting specific behaviors in the second phase. The detailed description of tasks used in experiments is provided in Appendix F.

5.1. SETUP

Baselines. We include the existing evasion attack methods for comparison to study the effectiveness of our approach. • Random attack: this is a naive baseline that samples random perturbed observations via a uniform distribution. • SA-RL (Zhang et al., 2021) : this method learns an adversarial policy in the form of an end-toend RL formulation. • PA-AD (Sun et al., 2022) : this method combines RL-based "director" and non-RL "actor" to find state perturbations, which is the state-of-the-art adversarial attack algorithm against DRL. • PALM: our proposed method, which collaboratively learns adversarial policy and weighting function with the guidance of intention policy. Implementation Settings. We compare PALM with existing adversarial attack methods, which attack the victim to reduce its cumulative reward rather than manipulate it. To achieve fair comparison, we make simple adjustments for SA-RL and PA-AD to suit our settings in the experiments. In their original version, both of these two methods use the negative value of the reward obtained by the victim to train an adversary. We replace it with the same estimated reward function r ψ as our method uses, which means they learn from human preferences. Following the settings in PEBBLE (Lee et al., 2021a) , we use a scripted teacher that provides ground truth preference labels. More details of scripted teacher and preference collection can be found in Appendix E. For the implementation of SA-RLfoot_0 and PA-ADfoot_1 , we use the released official codebase. For fair comparison, all methods learned via preference-based RL are given the same number of preference labels. In the navigation scenario, we use 9000 labels for all tasks. In the opposite behaviors scenario, we use 1000 for Window Close, 3000 for Drawer Close, 5000 for Faucet Open, Faucet Close and Window Open, 7000 for Drawer Open, Door Lock and Door Unlock. Also, to reduce the impact of preference-based RL, we additionally add oracle versions of SA-RL and PA-AD, which uses the ground-truth rewards of the targeted task. We use the same experimental settings (i.e., hyper-parameters, neural networks) concerning reward learning for all methods. We quantitatively evaluate all methods by comparing the success rate of final manipulation, which is well-defined in Meta-world (Yu et al., 2020) for the opposite behaviors scenario, and we rigorously design for the navigation scenario. As in most existing research (Zhang et al., 2020; 2021; Sun et al., 2022) , we consider using state attacks with ℓ ∞ norm in our experiments, and we report the mean and standard deviation across ten runs for all experiments. We also provide detailed hyper-parameter settings and implementation details in Appendix F.

5.2. MANIPULATION ON DRL AGENTS

We study the effectiveness of our method compared to adversarial attack algorithms, which are adapted to our setting with minimal changes. Specifically, we construct two different scenarios on various simulated robotic manipulation tasks. Each victim agent is well-trained for a specific manipulation task.

Scenarios on Navigation.

In this scenario, we expect the robotic arm to reach a target coordinates instead of completing the original task. Figure 4 shows the training curves of baselines and our method on eight manipulation tasks. It shows that the performance of PALM surpasses that of the baselines by a large margin based on preference labels. To eliminate the influence of preferencebased RL and further demonstrate the advantages of PALM, we additionally train the baseline methods with the ground-truth reward function and denote them as "oracle". We notice that the performance of SA-RL (oracle) greatly improves on several tasks over the preference-based version. However, PALM still outperforms SA-RL with oracle rewards on most tasks. These results demonstrate that PALM enables the agent to efficiently learn adversarial policy with human preferences. We also observe that PA-AD is incapable of mastering manipulation, even using the ground-truth rewards. (QYLURQPHQW6WHSV Scenarios on Opposite Behaviors. In the real world, robotic manipulation has good application values. Therefore, we design this scenario to quantitatively evaluate the vulnerability of these agents that masters various manipulation skills. Specifically, we expect each victim to complete the opposite task under the attack of the manipulator. For example, the victim which masters the skill of opening windows will close windows under targeted attack. As shown in Figure 5 , PALM presents excellent performance and marginally shows obvious advantages over baseline methods on all tasks. The result again indicates that PALM is effective for a wide range of tasks and can efficiently learn adversarial policy with human preferences. × 6XFFHVV5DWH 3$/0 3$$'RUDFOH 3$$' 6$5/RUDFOH 6$5/ 5DQGRP (QYLURQPHQW6WHSV × 6XFFHVV5DWH

5.3. MANIPULATION ON THE POPULAR OFFLINE RL AGENTS

In this experiment, we show the vulnerability of offline RL agents and demonstrate PALM can fool them into acting human desired behaviors. As for the implementation, we choose some online modelsfoot_2 as victims, which are well-trained by official implementation with D4RL. We choose two tasks, Cheetah and Walker, using expert-level Decision Transformer agents as the victims. As shown in Figure 6 , Decision Transformer shows exploitable weaknesses and is misled to perform human desired behavior instead of the original task. Specifically, under the adversarial manipulation, the Cheetah agent runs backwards quickly in Figure 6a , and does 90 degree push-up in Figure 6c . The Walker agent stands on one foot for superior balance in Figure 6b , and dances with one leg lifted in Figure 6d . The results show that PALM can manipulate these victims to act behaviors consistent with human preferences and embodied agents are extremely vulnerable to these welltrained adversaries. We hope this experiment can inspire future work on the robustness of offline RL agents and embodied AI. Contribution of Each Component. We conduct additional experiments to investigate the effect of each component in PALM on Drawer Open, Drawer Close for the navigation scenario and on Faucet Open, Faucet Close for the opposite behavior scenario. PALM contains three critical components: the weight function h ω , the intention policy π θ , and the combined policy. Table 1 shows that the intention policy plays an essential role in the PALM. As shown in Figure 7 , the intention policy can mitigate exploration difficulty caused by the restriction of victim policy and improve the exploration ability of PALM leading to a better adversary. We also observe that the combined policy balances the discrepancy between π θ and π ν•α on the state distribution and improves the adversary's performance. In addition, we can economically train the weighting function to distinguish state importance by formulating the adversary learning as a bi-level optimization. It can further improve the asymptotic performance of PALM. These empirical results show that key ingredients of PALM are fruitfully wed and contribute to the PALM's success. To verify the restricted exploration problem, we visualize the exploration space of PALM and PALM without intention policy. Figure 7 shows that the intention policy significantly improve the exploration ability of PALM.

5.4. ABLATION STUDY

Effects of the Weighting Function. To further understand the role of the weighting function proposed in Section 4, we conduct experimental data analysis and visualization from multiple perspectives. Five perturbed policies are uniformly sampled with performance increase sequentially before PALM convergence. For each policy, we roll out 100 trajectories and obtain the trajectory weight vectors via the weighting function. By leveraging the technique of t-SNE (van der Maaten & Hinton, 2008) , the weight vectors of different policies are visualized in Figure 8a . From the figure, we can clearly observe clear boundaries between the trajectory weights of various policies, suggesting that the weighting function can distinguish trajectories of different qualities. In Figure 8b , the darker color indicates trajectories with higher success rates of manipulation. The result shows that the weighting function gives higher weights to better trajectories for improving the adversarial policy performance. To further illustrate the effect of the weighting function, we present a heat map of the weight distribution in 2D coordinates and annotate part of the trajectories of the perturbed policy. As Figure 8c shows, the weighting function scores the surrounding states in trajectories from the perturbed policy higher, especially in the early stage before reaching the target point. Extensive experiments are conducted to analyze and discuss the impact of feedback amount and attack budgets on the performance of PALM in the Appendix G.

6. CONCLUSION

In this paper, we propose PALM, a preference-based adversarial attack approach against DRL, which can mislead the victim to perform desired behaviors of adversaries. PALM involves an adversary adding imperceptible perturbations on the observations of the victim, an intention policy learned through preference-based RL for better exploration, and a weighting function to identify essential states for the efficient adversarial attack. We analyze the convergence of PALM and prove that PALM converges to critical points under some mild conditions. Empirically, we design two scenarios on several manipulation tasks of Meta-world, and the results demonstrate that PALM outperforms the baselines under the targeted adversarial setting. We further show embodied agents' vulnerability by attacking Decision Transformer on some Mujoco tasks. For future work, we consider: (1) to further improving the attack efficiency by enhancing the utilization efficiency of human preference and (2) extending the observation space of the victim to the high-dimensional inputs, such as images and natural language. ETHICS STATEMENT Preference-based RL provides an effective way to learn agents without a carefully designed reward function. However, learning from human preferences means humans need to provide labeled data which inevitably has biases introducing systematic error. There are possible negative impacts when malicious people attack other policies using our methods. However, our approach also makes other researchers aware of the vulnerability of policies for AI safety.

REPRODUCIBILITY STATEMENT

The details of experiment settings are provided in Section 4. We provide detailed proofs of theoretical analysis in Appendix D. A more detailed description and implementation setting can be found in Appendix F. Meanwhile, we present the link of our source code and videos in the abstract.

A THE FULL PROCEDURE OF PALM

The Combined Policy. Although π θ guides adversarial policy learning, the discrepancy between π θ and π ν•α on the state distribution leads to inefficiency. To handle this issue, we design a strategy to construct the behavior policy π to collect transitions in our practical implementation. Inspired by Branched rollout (Janner et al., 2019) , we combine the intention policy π θ and the perturbed policy π ν•α , where π 1:h = π 1:h ν•α , π h+1:H = π h+1:H θ , h ∼ U (0, H) and H is task horizon. The combined policy π collects data and stores it into the replay buffer during learning. We provide detailed procedures of our proposed method in Algorithm 1. PALM is implemented based on a popular preference-based RL algorithm PEBBLE (Lee et al., 2021a) .

Algorithm 1 PALM

Input: a fixed victim policy π ν , frequency of human feedback K, outer loss updating frequency M , task horizon H 1: Initialize parameters of Q ϕ , π θ , r ψ , π α and h ω 2: Initialize B and π θ with unsupervised exploration 3: Initialize preference data set D ← ∅ 4: for each iteration do 5: if episode is done then ▷ Construct the combined policy π 6: h ∼ U (0, H) 7: π 1:h = π 1:h ν•α and π h+1:H = π h+1:H θ 8: end if 9: Take action a t ∼ π and collect s t+1 10: Store transition into dataset B ← B ∪ {(s t , a t , r ψ (s t ), s t+1 )} 11: if iteration % K == 0 then 12: for each query step do ▷ Query preference 13: Sample pair of trajectories (σ 0 , σ 1 ) 14: Query preference y from manipulator 15: Store preference data into dataset D ← D ∪ {(σ 0 , σ 1 , y)} 16: end for 17: for each gradient step do ▷ Update reward model 18: Sample batch {(σ 0 , σ 1 , y) i } n i=1 from D 19: Optimize (2) to update r ψ 20: end for 21: end if

22:

for each gradient step do Update Q ϕ and π θ according to (3) and (4), respectively. 31: end for Output: adversarial policy π α

B DERIVATION OF THE GRADIENT OF THE OUTER-LEVEL LOSS

In this section, we present detailed derivation of the gradient of the outer loss J π with respect to the parameters of the weighting function ω. According to the chain rule, we can derive that ∇ ω J π (α(ω))| ωt = ∂J π (α(ω)) ∂ α(ω) αt ∂ αt (ω) ∂ω ωt = ∂J π (α(ω)) ∂ α(ω) αt ∂ αt (ω) ∂h(s; ω) ωt ∂h(s; ω) ∂ω ωt = -η t ∂J π ( α(ω)) ∂ α(ω) αt s∼B ∂D KL (π ν•α (s) ∥ π θ (s)) ∂α αt ∂h(s; ω) ∂ω ωt = -η t s∼B ∂J π ( α(ω)) ∂ α(ω) ⊤ αt ∂D KL (π ν•α (s) ∥ π θ (s)) ∂α αt ∂h(s; ω) ∂ω ωt . For brevity of expression, we let: f (s) = ∂J π (α(ω)) ∂ α(ω) ⊤ αt ∂D KL (π ν•α (s) ∥ π θ (s)) ∂ α αt . The gradient of outer-level optimization loss with respect to parameters ω is: ∇ ω J π (α(ω))| ωt = -η t s∼B f (s) • ∂h(s; ω) ∂ω ωt . C CONNECTION BETWEEN RSA-MDP AND MDP Lemma 1. Given a RSA-MDP M = (S, A, B, R, P, γ) and a fixed victim policy π ν , there exists a MDP M = (S, Â, R, P, γ) such that the optimal policy of M is equivalent to the optimal adversary π α in RSA-MDP given a fixed victim, where A = S and P(s ′ |s, a) = a∈A π ν (a| a)P(s ′ |s, a) for s, s ′ ∈ S and a ∈ A.

D THEORETICAL ANALYSIS AND PROOFS

D.1 THEOREM 1: CONVERGENCE RATE OF THE OUTER LOSS Lemma 2. (Lemma 1.2.3 in Nesterov (1998)) If function f (x) is Lipschitz smooth on R n with constant L, then ∀x, y ∈ R n , we have f (y) -f (x) -f ′ (x) ⊤ (y -x) ≤ L 2 ∥y -x∥ 2 . ( ) Proof. ∀x, y ∈ R n , we have f (y) = f (x) + 1 0 f ′ (x + τ (y -x)) ⊤ (y -x)dτ = f (x) + f ′ (x) ⊤ (y -x) + 1 0 [f ′ (x + τ (y -x)) -f ′ (x)] ⊤ (y -x)dτ. Then we can derive that f (y) -f (x) -f ′ (x) ⊤ (y -x) = 1 0 [f ′ (x + τ (y -x)) -f ′ (x)] ⊤ (y -x)dτ ≤ 1 0 [f ′ (x + τ (y -x)) -f ′ (x)] ⊤ (y -x) dτ ≤ 1 0 ∥f ′ (x + τ (y -x)) -f ′ (x)∥ • ∥y -x∥ dτ ≤ 1 0 τ L ∥y -x∥ 2 dτ = L 2 ∥y -x∥ 2 , where the first inequality holds for b a f (x)dx ≤ b a |f (x)| dx, the second inequality holds for Cauchy-Schwarz inequality, and the last inequality holds for the definition of Lipschitz smoothness. Theorem 1. Suppose J π is Lipschitz-smooth with constant L, the gradient of J π and L att is bounded by ρ. Let the training iterations be T , the inner-level optimization learning rate η t = min{1, c1 T } for some constant c 1 > 0 where c1 T < 1. Let the outer-level optimization learning rate β t = min{ 1 L , c2 √ T } for some constant c 2 > 0 where c 2 ≤ √ T L , and ∞ t=1 β t ≤ ∞, ∞ t=1 β 2 t ≤ ∞. The convergence rate of J π achieves min 1≤t≤T E ∥∇ ω J π (α t+1 (ω t ))∥ 2 ≤ O 1 √ T . ( ) Proof. First, J π ( αt+2 (ω t+1 )) -J π (α t+1 (ω t )) = {J π (α t+2 (ω t+1 )) -J π (α t+1 (ω t+1 ))} + {J π ( αt+1 (ω t+1 )) -J π (α t+1 (ω t ))} . ( ) Then we separately derive the two terms of ( 17). For the first term, J π (α t+2 (ω t+1 )) -J π (α t+1 (ω t+1 )) ≤∇ αJ π ( αt+1 (ω t+1 )) ⊤ (α t+2 (ω t+1 ) -αt+1 (ω t+1 )) + L 2 ∥α t+2 (ω t+1 ) -αt+1 (ω t+1 )∥ 2 ≤ ∥∇ αJ π (α t+1 (ω t+1 ))∥ • ∥α t+2 (ω t+1 ) -αt+1 (ω t+1 )∥ + L 2 ∥α t+2 (ω t+1 ) -αt+1 (ω t+1 )∥ 2 ≤ρ • ∥-η t+1 ∇ αL att (α t+1 )∥ + L 2 ∥-η t+1 ∇ αL att (α t+1 )∥ 2 ≤η t+1 ρ 2 + L 2 η 2 t+1 ρ 2 , where αt+2 (ω t+1 ) -αt+1 (ω t+1 ) = -η t+1 ∇ αL att (α t+1 ), the first inequality holds for Lemma 2, the second inequality holds for Cauchy-Schwarz inequality, the third inequality holds for ∥∇ αJ π (α t+1 (ω t+1 ))∥ ≤ ρ, and the last inequality holds for ∥∇ αL att (α t+1 )∥ ≤ ρ. It can be proved that the gradient of ω with respect to J π is Lipschitz continuous and we assume the Lipschitz constant is L. Therefore, for the second term, J π (α t+1 (ω t+1 )) -J π (α t+1 (ω t )) ≤∇ ω J π (α t+1 (ω t )) ⊤ (ω t+1 -ω t ) + L 2 ∥ω t+1 -ω t ∥ 2 = -β t ∇ ω J π (α t+1 (ω t )) ⊤ ∇ ω J π (α t+1 (ω t )) + Lβ 2 t 2 ∥∇ ω J π (α t+1 (ω t ))∥ 2 = -(β t - Lβ 2 t 2 ) ∥∇ ω J π (α t+1 (ω t ))∥ 2 , ( ) where ω t+1 -ω t = -β t ∇ ω J π (α t+1 (ω t )), and the first inequality holds for Lemma 2. Therefore, (17) becomes J π ( αt+2 (ω t+1 )) -J π (α t+1 (ω t )) ≤ η t+1 ρ 2 + L 2 η 2 t+1 ρ 2 -(β t - Lβ 2 t 2 ) ∥∇ ω J π (α t+1 (ω t ))∥ 2 . (20) Rearranging the terms of (20), we obtain (β t - Lβ 2 t 2 ) ∥∇ ω J π (α t+1 (ω t ))∥ 2 ≤ J π (α t+1 (ω t )) -J π (α t+2 (ω t+1 )) + η t+1 ρ 2 + L 2 η 2 t+1 ρ 2 . ( ) Then, we sum up both sides of (21), T t=1 (β t - Lβ 2 t 2 ) ∥∇ ω J π (α t+1 (ω t ))∥ 2 ≤J π (α 2 (ω 1 )) -J π (α T +2 (ω T +1 )) + T t=1 (η t+1 ρ 2 + L 2 η 2 t+1 ρ 2 ) ≤J π (α 2 (ω 1 )) + T t=1 (η t+1 ρ 2 + L 2 η 2 t+1 ρ 2 ). Therefore, min 1≤t≤T E ∥∇ ω J π (α t+1 (ω t ))∥ 2 ≤ T t=1 (β t - Lβ 2 t 2 ) ∥∇ ω J π (α t+1 (ω t ))∥ 2 T t=1 (β t - Lβ 2 t 2 ) ≤ 1 T t=1 (2β t -Lβ 2 t ) 2J π (α 2 (ω 1 )) + T t=1 (2η t+1 ρ 2 + Lη 2 t+1 ρ 2 ) ≤ 1 T t=1 β t 2J π (α 2 (ω 1 )) + T t=1 η t+1 ρ 2 (2 + Lη t+1 ) ≤ 1 T β t 2J π (α 2 (ω 1 )) + T η t+1 ρ 2 (2 + L) = 2J π (α 2 (ω 1 )) T β t + η t+1 ρ 2 (2 + L) β t = 2J π (α 2 (ω 1 )) T max{L, √ T c 2 } + min{1, c 1 T } max{L, √ T c 2 }ρ 2 (2 + L) ≤ 2J π (α 2 (ω 1 )) c 2 √ T + c 1 ρ 2 (2 + L) c 2 √ T =O 1 √ T , where the second inequality holds according to ( 22), the third inequality holds for Theorem 2. Suppose J π is Lipschitz-smooth with constant L, the gradient of J π and L att is bounded by ρ. Let the training iterations be T , the inner-level optimization learning rate η t = min{1, c1 T } for some constant c 1 > 0 where c1 T < 1. Let the outer-level optimization learning rate β t = min{ 1 L , c2 √ T } for some constant c 2 > 0 where c 2 ≤ √ T L , and ∞ t=1 β t ≤ ∞, ∞ t=1 β 2 t ≤ ∞. L att achieves lim t→∞ E ∥∇ α L att (α t ; ω t )∥ 2 = 0. ( ) Proof. First, L att (α t+1 ; ω t+1 ) -L att (α t ; ω t ) = {L att (α t+1 ; ω t+1 ) -L att (α t+1 ; ω t )} + {L att (α t+1 ; ω t ) -L att (α t ; ω t )} . ( ) For the first term in (25), L att (α t+1 ; ω t+1 ) -L att (α t+1 ; ω t ) ≤∇ ω L att (α t+1 ; ω t ) ⊤ (ω t+1 -ω t ) + L 2 ∥ω t+1 -ω t ∥ 2 = -β t ∇ ω L att (α t+1 ; ω t ) ⊤ ∇ ω J π (α t+1 (ω t )) + Lβ 2 t 2 ∥∇ ω J π (α t+1 (ω t ))∥ 2 . ( ) where ω t+1 -ω t = -β t ∇ ω J π (α t+1 (ω t )), and the first inequality holds according to Lemma 2. For the second term in (25), L att (α t+1 ; ω t ) -L att (α t ; ω t ) ≤∇ α L att (α t ; ω t ) ⊤ (α t+1 -α t ) + L 2 ∥α t+1 -α t ∥ 2 = -η t ∇ α L att (α t ; ω t ) ⊤ ∇ α L att (α t ; ω t ) + Lη 2 t 2 ∥∇ α L att (α t ; ω t )∥ 2 = -(η t - Lη 2 t 2 ) ∥∇ α L att (α t ; ω t )∥ 2 . ( ) where α t+1 -α t = -η t ∇ α L att (α t ; ω t ), and the first inequality holds according to Lemma (2). Therefore, (25) becomes L att (α t+1 ; ω t+1 ) -L att (α t ; ω t ) ≤ -β t ∇ ω L att (α t+1 ; ω t ) ⊤ ∇ ω J π (α t+1 (ω t )) + Lβ 2 t 2 ∥∇ ω J π (α t+1 (ω t ))∥ 2 -(η t - Lη 2 t 2 ) ∥∇ α L att (α t ; ω t )∥ 2 . Taking expectation of both sides of (28) and rearranging the terms, we obtain η t E ∥∇ α L att (α t ; ω t )∥ 2 + β t E [∥∇ ω L att (α t+1 ; ω t )∥ • ∥∇ ω J π (α t+1 (ω t ))∥] ≤E [L att (α t ; ω t )] -E [L att (α t+1 ; ω t+1 )] + Lβ 2 t 2 E ∥∇ ω J π (α t+1 (ω t ))∥ 2 + Lη 2 t 2 E ∥∇ α L att (α t ; ω t )∥ 2 . ( ) Summing up both sides of (29 ) from t = 1 to ∞, ∞ t=1 η t E ∥∇ α L att (α t ; ω t )∥ 2 + ∞ t=1 β t E [∥∇ ω L att (α t+1 ; ω t )∥ • ∥∇ ω J π (α t+1 (ω t ))∥] ≤E [L att (α 1 ; ω 1 )] -lim t→∞ E [L att (α t+1 ; ω t+1 )] + ∞ t=1 Lβ 2 t 2 E ∥∇ ω J π (α t+1 (ω t ))∥ 2 + ∞ t=1 Lη 2 t 2 E ∥∇ α L att (α t ; ω t )∥ 2 ≤ ∞ t=1 L(η 2 t + β 2 t )ρ 2 2 + E [L att (α 1 ; ω 1 )] ≤ ∞, where the second inequality holds for ∞ t=1 η 2 t ≤ ∞, ∞ t=1 β 2 t ≤ ∞, ∥∇ α L att (α t ; ω t )∥ ≤ ρ, ∥∇ ω J π (α t+1 (ω t ))∥ ≤ ρ. Since ∞ t=1 β t E [∥∇ ω L att (α t+1 ; ω t )∥ • ∥∇ ω J π (α t+1 (ω t ))∥] ≤ Lρ ∞ t=1 β t ≤ ∞. Therefore, we have ∞ t=1 η t E ∥∇ α L att (α t ; ω t )∥ 2 < ∞. Since |(∥a∥ + ∥b∥)(∥a∥ -∥b∥)| ≤ ∥a + b∥∥a -b∥, we can derive that E ∥∇ α L att (α t+1 ; ω t+1 )∥ 2 -E ∥∇ α L att (α t ; ω t )∥ 2 = E ∥∇ α L att (α t+1 ; ω t+1 )∥ + ∥∇ α L att (α t ; ω t )∥ + ∥∇ α L att (α t+1 ; ω t+1 )∥ -∥∇ α L att (α t ; ω t )∥ ≤E ∥∇ α L att (α t+1 ; ω t+1 )∥ + ∥∇ α L att (α t ; ω t )∥ ∥∇ α L att (α t+1 ; ω t+1 )∥ -∥∇ α L att (α t ; ω t )∥ ≤E ∥∇ α L att (α t+1 ; ω t+1 ) + ∇ α L att (α t ; ω t )∥ • ∥∇ α L att (α t+1 ; ω t+1 ) -∇ α L att (α t ; ω t )∥ ≤E ∥∇ α L att (α t+1 ; ω t+1 )∥ + ∥∇ α L att (α t ; ω t )∥ ∥∇ α L att (α t+1 ; ω t+1 ) -∇ α L att (α t ; ω t )∥ ≤2LρE ∥(α t+1 , ω t+1 ) -(α t , ω t )∥ ≤2Lρη t β t E ∥(∇ α L att (α t ; ω t ), ∇ ω J π (α t+1 (ω t )))∥ ≤2Lρη t β t E ∥∇ α L att (α t ; ω t )∥ 2 + E ∥∇ ω J π (α t+1 (ω t ))∥ 2 ≤2Lρη t β t 2ρ 2 ≤2 √ 2Lρ 2 η t β t . (33) Since ∞ t=1 η t = ∞, according to Lemma 3, we have lim t→∞ E ∥∇ α L att (α t ; ω t )∥ 2 = 0.

E DETAILS OF PBRL

In this section, we present details of the scripted teacher and preference collection. It is a crucial part of the PbRL, and PALM follows these settings as Lee et al. (2021a) . Scripted Teacher. To evaluate the performance systemically, a useful way is to consider a scripted teacher that provides preferences between a pair of agent's trajectory segments according to the oracle reward function. Leveraging the preference labels from the human teacher is ideal, while it is hard to evaluate algorithms quantitatively and quickly. Specifically, the scripted teacher can immediately provide ground truth rewards based on the state s and action a. It is a function designed to approximate the human's intention. Preference Collection. During training, we need to query human preference labels at regular intervals. It samples a batch of segment pairs and calculates the cumulative reward of each segment with rewards provided by the scripted teacher. For a specific segment pair, human prefers the segment with a larger cumulative reward. The segment with a larger cumulative reward is labelled with 1, and the smaller one is labelled with 0. As for the computational cost, we assume that M preference labels are required, the segment length is N in a run, and the time complexity is O(M N ). However, it is negligible compared with adversary training, which involves complex gradient computation.

F EXPERIMENTAL DETAILS

In this section, we provide a concrete description of our experiments and detailed hyper-parameters of PALM. For each run of experiments, we run on GeForce RTX 2080 Ti GPUs for training.

F.1 TASKS

We conduct experiments on eight robotic manipulation tasks from Meta-world (Yu et al., 2020) in phase one and two locomotion tasks from Mujoco (Todorov et al., 2012) in phase two. The tasks we used are:

Meta-world

• Door Lock: An agent controls a simulated Sawyer arm to lock the door. 

G EXTENSIVE EXPERIMENTS

Impact of Feedback Amount. We investigate the performance of PALM with different preference labels. Table 5 shows the results of all methods with various numbers of labels: {3000, 5000, 7000, 9000} on Drawer Open for the navigation scenario and {1000, 3000, 5000, 7000} on Faucet Close for the opposite behavior scenario. From the experimental results in Table 5 , we conclude that providing sufficient human feedback learns a strong adversary and stabilizes the attack success rate. We notice that the performance of PALM improves with the increase of human preference labels, indicating that the number of preference labels has an essential impact on adversary learning. However, the performance of SA-RL and PA-AD is poor even with enough human feedback and PA-AD completely fails in navigation scenario. The reason is that the exploration space of these two methods is limited by the fixed victim policy, while PALM achieves better exploration by introducing an intention policy. Table 5 : Success rate with various amount of preference labels on Drawer Open for the navigation scenario and Faucet Close for the opposite behavior scenario. We report the mean and the standard deviation of the success rate over 30 episodes. Impact of Different Attack Budgets. We also analyze the effect of the attack budget. For further understanding, we conduct additional experiments with various attack budgets: {0.05, 0.075, 0.1, 0.15} on the Drawer Open and {0.02, 0.05, 0.075, 0.1} on Faucet Close for these two scenarios. In the Figure 9 , we report the performance of baseline and PALM with various attack budgets. As the experimental results show, the performance of all methods improve as the adversarial budget increases.



https://github.com/rll-research/BPref https://github.com/umd-huang-lab/paad_adv_rl https://huggingface.co/edbeeching https://huggingface.co/edbeeching/decision-transformer-gym-halfcheetah-expert https://huggingface.co/edbeeching/decision-transformer-gym-walker2d-expert



Figure 1: Illustration of targeted attack from PALM. The adversary first receives the true state s from the environment and perturbs it into s. Then the victim observes s and takes action according to it.

Figure2: Framework of PALM. PALM jointly learns an intention policy, an adversary and a weighting function under bi-level optimization framework. In the inner-level, the adversary is optimized to approach the intention policy which learns via preference-based RL. In the outer-level, the weighting function is updated to maximize the performance of the adversary evaluated by the outer loss.

Figure 3: Diagram of preference-based RL.

Figure 4: Training curves of different methods on various manipulation tasks in the navigation scenario. The solid line and shaded area denote the mean and the standard deviation of success rate, respectively, over ten runs.

Figure 6: Human desired behaviors behaved by the Decision Transformer under the attack of PALM.

Figure 7: A visualization of the exploration space of PALM (red) and PALM without intention policy (blue). The green point denotes the start and the yellow star denotes the target position.

Figure 8: (a) A visualization of the weights of trajectories of different qualities is collected by five different policies. (b) Trajectory weights generated by the weighting function from different policies are extracted and visualized with t-SNE. (c) A heat map showing the weight distribution and the trajectory of the perturbed agent in 2D coordinates. The red point denotes the start position and the yellow star indicates the targeted position.

random mini-batch transitions from B 28: Optimize h ω : minimize (7) with respect to ω ▷ Outer loss optimization 29: end if 30:

2: CONVERGENCE OF THE INNER LOSS Lemma 3. (Lemma A.5 in Mairal (2013)) Let (a n ) n≥1 , (b n ) n≥1 be two non-negative real sequences such that the series ∞ n=1 a n diverges, the series ∞ n=1 a n b n converges, and there exists C > 0 such that |b n+1 -b n | ≤ Ca n . Then, the sequence (b n ) n≥1 converges to 0.

Effects of each component. The success rate on four simulated robotic manipulation tasks from Meta-world. The results are the average success rate across five runs.

Hyper-parameters of SAC for victim training.Mujoco. We directly utilize the well-trained model for demonstrating the vulnerability of the Decision Transformer. Specifically, we use the Cheetah agent 4 and the Walker agent 5 with expert-level.

annex

• Door Unlock: An agent controls a simulated Sawyer arm to unlock the door.• Drawer Open: An agent controls a simulated Sawyer arm to open the drawer to a target position.• Drawer Close: An agent controls a simulated Sawyer arm to close the drawer to a target position.• Faucet Open: An agent controls a simulated Sawyer arm to open the faucet to a target position.• Faucet Close: An agent controls a simulated Sawyer arm to close the faucet to a target position.• Window Open: An agent controls a simulated Sawyer arm to open the window to a target position.• Window Close: An agent controls a simulated Sawyer arm to close the window to a target position.

Mujoco

• Half Cheetah: A 2-dimensional robot with nine links and eight joints aims to learn to run forward (right) as fast as possible.• Walker: A 2-dimensional two-legged robot aims to move in the forward (right).

F.2 HYPER-PARAMETERS SETTING

We choose PEBBLE as For SA-RL (Zhang et al., 2021) , we keep the same parameter setting with parameter and use the same neural network structure as ours. The detailed hyper-parameters of SA-RL is shown in Table 3 . For PA-AD (Sun et al., 2022) , all hyper-parameters are the same as those of SA-RL. 

F.3 VICTIM SETTING

Our experiment consists of two phases. In the first phase, we use various simulated robotic manipulation tasks from Meta-world. We have two OpenAI Gym MuJoCo continuous control environments in the second phase.Meta-world. We train the victims on Meta-world by utilizing SAC (Haarnoja et al., 2018) with the original fully connected neural network as policy. Detailed hyper-parameters are shown in Table 4 . 

