ILLUSORY ADVERSARIAL ATTACKS ON SEQUENTIAL DECISION-MAKERS AND COUNTERMEASURES

Abstract

Autonomous decision-making agents deployed in the real world need to be robust against possible adversarial attacks on sensory inputs. Existing work on adversarial attacks focuses on the notion of perceptual invariance popular in computer vision. We observe that such attacks can often be detected by victim agents, since they result in action-observation sequences that are not consistent with the dynamics of the environment. Furthermore, real-world agents, such as physical robots, commonly operate under human supervisors who are not susceptible to such attacks. We propose to instead focus on attacks that are statistically undetectable. Specifically, we propose illusory attacks, a novel class of adversarial attack that is consistent with the environment dynamics. We introduce a novel algorithm that can learn illusory attacks end-to-end. We empirically verify that our algorithm generates attacks that, in contrast to current methods, are undetectable to both AI agents with an environment dynamics model, as well as to humans. Furthermore, we show that existing robustification approaches are relatively ineffective against illusory attacks. Our findings highlight the need to ensure that real-world AI, and human-AI, systems are designed to make it difficult to corrupt sensory observations in ways that are consistent with the environment dynamics.

1. INTRODUCTION

Deep reinforcement learning algorithms (Mnih et al., 2015; Schulman et al., 2017; Haarnoja et al., 2018; Salimans et al., 2017, DQN, PPO, SAC, ES) have found applications across a number of sequential decision-making problems, ranging from simulated and real-world robotics (Todorov et al., 2012; Andrychowicz et al., 2020) to arcade games (Mnih et al., 2015) . It has recently been found, however, that deep neural network control policies conditioning on high-dimensional sensory input are prone to adversarial attacks on the input observations, which poses threats to security and safety-critical applications (Kos & Song, 2017; Huang et al., 2017) and thus motivates research into robust learning algorithms (Zhang et al., 2020) . Existing frameworks of attacks on sequential decision-makers are largely inspired by pioneering work on perceptually invariant attacks in supervised computer vision settings (Ilahi et al., 2021) . Unlike supervised settings, however, sequential decision-making settings involve temporally-extended environment interactions which give rise to temporally-correlated sequences of observations. In this paper, we argue that their failure to take temporal consistency considerations into account renders existing observation-space adversarial attacks ineffective in many settings of practical interest. AI agents often have access to an approximate or exact world model (Sutton, 2022; Ha & Schmidhuber, 2018) . In addition, humans have the ability to perform "intuitive physics" (Hamrick et al., 2016) , using robust but qualitative internal models of the world (Battaglia et al., 2013) . This makes it possible to use one's understanding of the world to detect a large range of existing adversarial attacks, by spotting inconsistencies in observation sequences. Existing observation-space adversarial attacks (Ilahi et al., 2021; Chen et al., 2019; Qiaoben et al., 2021; Sun et al., 2020 ) ignore these facts. As a result, these attacks produce observation trajectories that are inconsistent with the dynamics of the unattacked environment. The consequences of this are twofold. First, state-of-the-art adversarial attacks can be trivially detected by victim agents with access to even low-accuracy world models, i.e., models of the environment dynamics. Second, in cases where AI agents are supervised by humans, humans may also can

