ILLUSORY ADVERSARIAL ATTACKS ON SEQUENTIAL DECISION-MAKERS AND COUNTERMEASURES

Abstract

Autonomous decision-making agents deployed in the real world need to be robust against possible adversarial attacks on sensory inputs. Existing work on adversarial attacks focuses on the notion of perceptual invariance popular in computer vision. We observe that such attacks can often be detected by victim agents, since they result in action-observation sequences that are not consistent with the dynamics of the environment. Furthermore, real-world agents, such as physical robots, commonly operate under human supervisors who are not susceptible to such attacks. We propose to instead focus on attacks that are statistically undetectable. Specifically, we propose illusory attacks, a novel class of adversarial attack that is consistent with the environment dynamics. We introduce a novel algorithm that can learn illusory attacks end-to-end. We empirically verify that our algorithm generates attacks that, in contrast to current methods, are undetectable to both AI agents with an environment dynamics model, as well as to humans. Furthermore, we show that existing robustification approaches are relatively ineffective against illusory attacks. Our findings highlight the need to ensure that real-world AI, and human-AI, systems are designed to make it difficult to corrupt sensory observations in ways that are consistent with the environment dynamics.

1. INTRODUCTION

Deep reinforcement learning algorithms (Mnih et al., 2015; Schulman et al., 2017; Haarnoja et al., 2018; Salimans et al., 2017, DQN, PPO, SAC, ES) have found applications across a number of sequential decision-making problems, ranging from simulated and real-world robotics (Todorov et al., 2012; Andrychowicz et al., 2020) to arcade games (Mnih et al., 2015) . It has recently been found, however, that deep neural network control policies conditioning on high-dimensional sensory input are prone to adversarial attacks on the input observations, which poses threats to security and safety-critical applications (Kos & Song, 2017; Huang et al., 2017) and thus motivates research into robust learning algorithms (Zhang et al., 2020) . Existing frameworks of attacks on sequential decision-makers are largely inspired by pioneering work on perceptually invariant attacks in supervised computer vision settings (Ilahi et al., 2021) . Unlike supervised settings, however, sequential decision-making settings involve temporally-extended environment interactions which give rise to temporally-correlated sequences of observations. In this paper, we argue that their failure to take temporal consistency considerations into account renders existing observation-space adversarial attacks ineffective in many settings of practical interest. AI agents often have access to an approximate or exact world model (Sutton, 2022; Ha & Schmidhuber, 2018) . In addition, humans have the ability to perform "intuitive physics" (Hamrick et al., 2016) , using robust but qualitative internal models of the world (Battaglia et al., 2013) . This makes it possible to use one's understanding of the world to detect a large range of existing adversarial attacks, by spotting inconsistencies in observation sequences. Existing observation-space adversarial attacks (Ilahi et al., 2021; Chen et al., 2019; Qiaoben et al., 2021; Sun et al., 2020 ) ignore these facts. As a result, these attacks produce observation trajectories that are inconsistent with the dynamics of the unattacked environment. The consequences of this are twofold. First, state-of-the-art adversarial attacks can be trivially detected by victim agents with access to even low-accuracy world models, i.e., models of the environment dynamics. Second, in cases where AI agents are supervised by humans, humans may also can Victim agents need to reach the green target as quickly as possible without traversing the orange lava. The adversary replaces the original victim observation (blue triangle) with an adversarial observation (red triangle). Note that in all scenarios, the victim ends up in the lava, upon which the episode terminates. However, the observations under the MNP and SA-MDP attacks (see Section 3) are not consistent with the f orward actions taken by the agent, i.e. the red arrow jumps between cells (top row), respectively incorrectly stays in the same position (middle row). In contrast, the observations under the proposed illusory attack (bottom row) are consistent with the environment dynamics. be able to detect these adversarial attacks. In other security contexts such as computer networks, real-world adversarial attacks typically attempt to evade detection in an effort to avoid triggering security escalations, and this must be taken into account by the defender. While this insight has been exploited in the cybersecurity community (Provos, 2001; Claffy & Dainotti, 2012; Cazorla et al., 2018) , undetectable adversarial attacks on sequential decision-makers and their defences have not yet been systematically explored in the AI community. In this paper, we introduce illusory attacks, a novel class of adversarial attacks on sequential decisionmakers that result in observation space attacks that are consistent with environment dynamics. We show that illusory attacks can succeed where existing attacks do not, and, in particular, can successfully fool humans. Illusory attacks therefore pose a specific safety and security threat to human-AI interaction and human-in-the-loop settings (Schirner et al., 2013; Zerilli et al., 2019) as they may can be undetectable even to attentive human supervisors. Illusory attacks seek to remain undetected and must hence attack the victim by replacing its perceived reality by an internally coherent alternative one. We show that perfect (statistically undetectable) illusory attacks exist in a variety of environments. We then present the W-illusory attack framework (see Figure 1 for illustration), which introduces a world-model consistency optimisation objective that encourages the resultant victim's action-observation histories to be consistent with the environment dynamics. We show that W-illusory attacks can be efficiently learnt, and are undetectable by humans, unlike MNP (Kumar et al., 2021) or SA-MDP attacks (Zhang et al., 2021) . We empirically demonstrate that existing victim robustification methods are largely ineffective against illusory attacks. This leads us to suggest that the existence of reality feedback, i.e., observation channels that are hardened against adversarial interference, may can play a decisive role in certifying the safety of real-world AI, and human-AI, systems. Our work makes the following contributions: • We formalise perfect illusory attacks, a novel framework for undetectable adversarial attacks (see Section 4.2). We give examples in common benchmark environments in Section 5.3. • We introduce W-illusory attacks, a computationally feasible learning algorithm for adversarial attacks that generate victim action-observation sequences that are consistent with the unperturbed environment dynamics (see Section 4.5). • We show that, compared to state-of-the-art adversarial attacks, W-illusory attacks are significantly less likely to be detected by AI agents (Section 5.2), as well as by humans (Section 5.4). • We demonstrate that victim robustification against W-illusory attacks is challenging unless the environment admits reality feedback (see Section 5.5).



Figure 1: Illustration of different classes of adversarial attacks on a 6-cell Gridworld environment.Victim agents need to reach the green target as quickly as possible without traversing the orange lava. The adversary replaces the original victim observation (blue triangle) with an adversarial observation (red triangle). Note that in all scenarios, the victim ends up in the lava, upon which the episode terminates. However, the observations under the MNP and SA-MDP attacks (see Section 3) are not consistent with the f orward actions taken by the agent, i.e. the red arrow jumps between cells (top row), respectively incorrectly stays in the same position (middle row). In contrast, the observations under the proposed illusory attack (bottom row) are consistent with the environment dynamics.

