PALM: PREFERENCE-BASED ADVERSARIAL MANIPU-LATION AGAINST DEEP REINFORCEMENT LEARNING

Abstract

Deep reinforcement learning (DRL) methods are vulnerable to adversarial attacks such as perturbing observations with imperceptible noises. To improve the robustness of DRL agents, it is important to study their vulnerability under adversarial attacks that would lead to extreme behaviors desired by adversaries. Preferencebased RL (PbRL) aims for learning desired behaviors with human preferences. In this paper, we propose PALM, a preference-based adversarial manipulation method against DRL agents which adopts human preferences to perform targeted attacks with the assistance of an intention policy and a weighting function. The intention policy is trained based on the PbRL framework to guide the adversarial policy to mitigate restrictions of the victim policy during exploration, and the weighting function learns weight assignment to improve the performance of the adversarial policy. Theoretical analysis demonstrates that PALM converges to critical points under some mild conditions. Empirical results on a few manipulation tasks of Meta-world show that PALM exceeds the performance of stateof-the-art adversarial attack methods under the targeted setting. Additionally, we show the vulnerability of the offline RL agents by fooling them into behaving as human desires on several Mujoco tasks. Our code and videos are available in https://sites.google.com/view/palm-adversarial-attack.

1. INTRODUCTION

Adversarial examples in image classifiers have prompted a new field of studying the vulnerability of deep neural networks (DNN). Recent researches demonstrate that reinforcement learning (RL) agents parameterized by DNN also show vulnerability under adversarial attacks (Huang et al., 2017; Pattanaik et al., 2018; Zhang et al., 2020; 2021; Sun et al., 2022) . Adversaries generate imperceptible perturbations to the observations of victim agent, making agents fail to complete the original behaviors. While adversarial attack is an crucial approach to evaluate the vulnerability of the agents, targeted attack in RL has received little attention. Recently, embodied intelligence (Gupta et al., 2021; Liu et al., 2022; Ahn et al., 2022; Reed et al., 2022; Fan et al., 2022) is considered as a meaningful way to improve the cognitive ability of artificial intelligence. These embodied agents show powerful capabilities while possibly exposing vulnerability. Therefore, we wonder that: How can one manipulate the agent to perform desired behaviors, and whether the embodied agents are robust to adversarial manipulations? To achieve targeted adversarial attacks, one straight way is to design respective rewards for the adversary agents. However, specifying a precise reward function can be challenging. For example, it is difficult to design a reward function denoting the goodness of the current step in the game of Go. For example, it is difficult in chess games to craft a reward function which can identify the quality of each move. In preference-based RL framework, a human only needs to provide binary preference labels over two trajectories of the agent (Christiano et al., 2017) . Compared to reward engineering, preference-based RL is an easier way to learn policies through human preferences. Meanwhile, recent research on preference-based RL shows an excellent capacity to learn novel behaviors with few preference labels (Lee et al., 2021a; Park et al., 2022; Liang et al., 2022) , and significantly improves feedback efficiency. Motivated by this, we consider using preference-based RL to perform targeted adversarial attacks from the following perspectives. On the one hand, it is difficult to define the desired behaviors to achieve targeted attacks, but humans can implicitly inject intentions by providing binary preferences. On the other hand, preference-based RL is a data-efficient manipulation method because a few preference labels are enough to learn well-behaved policies. As shown in Figure 1 , we emphasize PALM can recover the reward function of desired behaviors via preference-based RL framework and manipulate agents to perform human desired behaviors. In this paper, we propose Preference-based AdversariaL Manipulation (PALM) algorithm, which performs targeted attack from human preferences. PALM includes an adversary to perturb the observations of the victim. To achieve targeted attack and better exploration, we introduce an intention policy, which is learned from human preferences, as the learning target of the adversary. Additionally, we utilize a weighting function to assist adversary learning by re-weighting state examples. In summary, our contributions are three-fold. Firstly, to the best of our knowledge, we propose the first targeted adversarial attack method against DRL agents via preference-based RL. Secondly, we theoretically analyze PALM and provide a convergence guarantee under some mild conditions. Lastly, we design two scenarios and experiments on Meta-world demonstrate that PALM outperforms the baselines by a large margin. Empirical results demonstrate that both online and offline RL agents are vulnerable to our proposed adversarial attacks.

2. RELATED WORK

Many previous works on adversarial attacks study the vulnerability of a DRL agent. Huang et al. (2017) computes adversarial perturbations via utilizing the technique of FGSM (Goodfellow et al., 2015) to mislead the victim policy, not to choose the optimal action. Pattanaik et al. ( 2018) presents an approach that leads the victim to select the worst action based on the Q-function of the victim. Gleave et al. ( 2020) conducts adversarial attacks under the two-player Markov game instead of perturbing the agent's observation. Zhang et al. (2020) proposes the state-adversarial MDP (SA-MDP) and develops two adversarial attack methods named Robust Sarsa (RS) and Maximal Action Difference (MAD). SA-RL (Zhang et al., 2021) directly optimizes the adversary policy to perturb state in the form of end-to-end RL. PA-AD (Sun et al., 2022) designs an RL-based "director" to find the optimal policy perturbing direction and construct an optimized-based "actor" to craft perturbed states according to the given direction. Methods of untargeted adversarial attack focus on making the victim policy fail, while our approach emphasizes manipulating the victim policy. That is to say, victim's behaviors are consistent with the preference of the manipulator under attacks. Another line of works (Pinto et al., 2017; Mandlekar et al., 2017; Pattanaik et al., 2018) 



Figure 1: Illustration of targeted attack from PALM. The adversary first receives the true state s from the environment and perturbs it into s. Then the victim observes s and takes action according to it.

consider using adversarial examples to improve the robustness of policies, although it is out of the scope of this paper.There are a few prior researches that focus on targeted attacks on RL agents.Lin et al. (2017)  first proposes a targeted adversarial attack method against DRL agents. It attacks the agent to reach a targeted state.Buddareddygari et al. (2022)  also present a strategy to mislead the agent towards to a specific state by placing an object in the environment. The hijacking attack(Boloor et al., 2020)   is proposed to attack agents to perform targeted actions on autonomous driving systems.Hussenot  et al. (2019)  provides a new perspective that attacks the agent to imitate a target policy. Our method differs that PALM manipulates victim behave as human desire and focuses on the preference-based RL.Xiao et al. (2019)  proposes the first adversarial attack method against real world visual navigation robot.Lee et al. (2021b)  investigates targeted adversarial attacks against the action space of the agent. Our method differs that PALM leverages preference-based RL to avoid reward engineering

