ADVERSARIAL CHEAP TALK

Abstract

Adversarial attacks in reinforcement learning (RL) often assume highly-privileged access to the victim's parameters, environment, or data. Instead, this paper proposes a novel adversarial setting called a Cheap Talk MDP in which an Adversary can merely append deterministic messages to the Victim's observation, resulting in a minimal range of influence. The Adversary cannot occlude ground truth, influence underlying environment dynamics or reward signals, introduce non-stationarity, add stochasticity, see the Victim's actions, or access their parameters. Additionally, we present a simple meta-learning algorithm called Adversarial Cheap Talk (ACT) to train Adversaries in this setting. We demonstrate that an Adversary trained with ACT can still significantly influence the Victim's training and testing performance, despite the highly constrained setting. Affecting train-time performance reveals a new attack vector and provides insight into the success and failure modes of existing RL algorithms. More specifically, we show that an ACT Adversary is capable of harming performance by interfering with the learner's function approximation, or instead helping the Victim's performance by outputting useful features. Finally, we show that an ACT Adversary can manipulate messages during train-time to directly and arbitrarily control the Victim at test-time.

1. INTRODUCTION

Learning agents are often trained in settings where adversaries are not able to influence underlying environment dynamics or reward signals, but may influence part of the agent's observations. For instance, adversaries may append arbitrary tags to content that will be used to train recommender systems. Similarly, they may rent space on interactive billboards near busy traffic intersections to influence data sets used for training self-driving cars. In financial markets, adversaries can alter the state of the order-book by submitting orders far out of the money (essentially for free). While these features are 'useless' from an information-theoretic point of view, it is common practice in end-to-end deep learning to include them as part of the input and let the model learn which features matter. For instance, self-driving cars typically do not omit useless parts of the visual input but instead learns to ignore them through training. Surprisingly, this paper demonstrates that an actor can still heavily influence the behaviour and performance of learning agents by controlling information only in these 'useless' channels, without knowing anything about the agent's parameters or training state. Most past work in adversarial RL assumes that adversaries can influence environment dynamics (Huang et al., 2017; Gleave et al., 2020) . For example, perturbing images and observations could obscure or alter relevant information, such as the ball's location in a Pong game (Kos and Song, 2017) . Furthermore, many attacks require access to the trained agent's weights and parameters to generate the adversarial inputs (Wang et al., 2021) . Finally, most of these attacks only cause the victim's policy to fail arbitrarily instead of giving the adversary full control over the victim's policy at test time (Gu et al., 2017; Kiourti et al., 2020; Salem et al., 2020; Ashcraft and Karra, 2021; Zhang et al., 2021) . Instead, we propose a novel minimum-viable setting called Cheap Talk MDPs. Adversaries are only allowed to modify 'useless' features that are appended to the Victim's observation as a deterministic function of the current state. These features represent parts of an agent's observations that are unrelated to rewards or transition dynamics. In particular, our model applies to Adversaries adding tags to content in recommender systems, or renting space on interactive billboards, or submitting orders far out of the money in financial markets. The setting is minimal in that Adversaries cannot use these features to occlude ground truth, influence environment dynamics or reward signals, inject stochasticity, introduce non-stationarity, see the Victim's actions, or access their parameters. Cheap Talk MDPs are formalised in Section 4, and we further justify minimality by proving in Proposition 1 that Adversaries cannot influence tabular Victims whatsoever in Cheap Talk MDPs. In particular, it follows that Adversaries can only influence Victims by interfering with their function approximator. We also prove more generally in Proposition 2 that Adversaries cannot prevent Victims with optimal convergence guarantees to converge to optimal rewards, even in non-tabular settings. Despite these restrictions, we show that Adversaries can still heavily influence agents parameterised by neural networks. In Section 5, we introduce a new meta-learning algorithm called Adversarial Cheap Talk (ACT) to train the Adversary. With an extensive set of experiments, we demonstrate in Section 6 that an ACT Adversary can manipulate a Victim to achieve a number of outcomes: 1. An ACT Adversary can prevent the Victim from solving a task, resulting in low rewards during training. We provide empirical evidence that the Adversary successfully sends messages which induce catastrophic interference in the Victim's neural network. 2. Conversely, an ACT Adversary can learn to send useful messages that improve the Victim's training process, resulting in higher rewards during training. 3. Finally, we introduce a training scheme that allows the ACT Adversary to directly and arbitrarily control the Victim directly at test-time. 2020) investigate adversarial attacks that influence test-time performance of reinforcement learning agents that were trained in self-play.

2. RELATED WORK

In contrast to our method, the adversarial agent can directly interact with the environment and the victim agent. The aforementioned test-time attacks largely work by generating perturbations that push the observations out of the Victim's training distribution. In contrast, in Cheap Talk MDPs, the Victim trains directly with the static adversarial features; thus, by definition, the Adversary cannot generate out-of-distribution or non-stationary inputs. Furthermore, these test-time attacks also assume the ability to directly train against a pre-trained static agent. In contrast, in Cheap Talk MDPs, the adversary is influences a learning agent with random initial parameters.

2.2. TRAIN-TIME ADVERSARIAL ATTACKS

In contrast to train-time adversarial attacks in RL, in test-time adversarial attacks the dversary interacts with a learning victim. Pinto et al. ( 2017) simultaneously trains an adversary alongside a reinforcement learning agent to robustify the victim's policy. Unlike in this work, the adversary is able to directly apply perturbation forces to the environment. We make further comparisons in Section 6.1. Backdoor attacks in reinforcement learning aim to introduce a vulnerability during train-time, which can be triggered at test-time. Kiourti et al. (2020) and Ashcraft and Karra (2021) assume the adversary can directly and fully modify the victim's observations and rewards in order to discretely insert a backdoor that triggers on certain inputs. This is unlike Cheap Talk MDPs, in which only 'useless' parts of the observations can be modified. Wang et al. ( 2021) considers the multi-agent setting where the adversary inserts a backdoor using its behaviour in the environment. Unlike in this work, the adversary can influence the underlying environment dynamics. Furthermore, each of these backdoor attacks simply cause the victim to fail when triggered. In contrast, we use the backdoor to fully control the victim.

2.3. FAILURE MODES IN DEEP REINFORCEMENT LEARNING

Previous works have shown that using neural networks as function approximators in reinforcement learning often results in multiple failure modes due to the non-stationarity of value function bootstrapping (van Hasselt et al., 2018) . In particular, works have shown that catastrophic interference

