

Abstract

Recently demonstrated physical-world adversarial attacks have exposed vulnerabilities in perception systems that pose severe risks for safety-critical applications such as autonomous driving. These attacks place adversarial artifacts in the physical world that indirectly cause the addition of universal perturbations to inputs of a model that can fool it in a variety of contexts. Adversarial training is the most effective defense against image-dependent adversarial attacks. However, tailoring adversarial training to universal perturbations is computationally expensive since the optimal universal perturbations depend on the model weights which change during training. We propose meta adversarial training (MAT), a novel combination of adversarial training with meta-learning, which overcomes this challenge by meta-learning universal perturbations along with model training. MAT requires little extra computation while continuously adapting a large set of perturbations to the current model. We present results for universal patch and universal perturbation attacks on image classification and traffic-light detection. MAT considerably increases robustness against universal patch attacks compared to prior work.

1. INTRODUCTION

Deep learning is currently the most promising method for open-world perception tasks such as in automated driving and robotics. However, the use in safety-critical domains is questionable, since a lack of robustness of deep learning-based perception has been demonstrated (Szegedy et al., 2014; Goodfellow et al., 2015; Metzen et al., 2017; Hendrycks & Dietterich, 2019) . In this work, we focus on two subsets of these physical-world attacks: local ones which place a printed pattern in a scene that does not overlap with the target object (Lee & Kolter, 2019; Huang et al., 2019) and global ones which attach a mainlytranslucent sticker on the lens of a camera (Li et al., 2019) . Note that these physical-world attacks have corresponding digital-domain attacks, in which the attacker directly modifies the signal after it was received by the sensor and before it is processed by the model. The corresponding digital-domain



Figure 1: Illustration of a digital universal patch attack against an undefended model (left) and a model defended with meta adversarial training (MAT, right) on Bosch Small Traffic Lights (Behrendt & Novak, 2017). A patch can lead the undefended model to detect non-existent traffic lights and miss real ones that would be detected without the patch (bottom left). In contrast, the same patch is ineffective against a MAT model (bottom right). Moreover, a patch optimized for the MAT model (top right), which bears a resemblance to traffic lights, does not cause the model to remove correct detections. Physical-world adversarial attacks (Kurakin et al., 2017; Athalye et al., 2018; Braunegg et al., 2020) are one of most problematic failures in robustness of deep learning. Examples of such attacks are fooling models for traffic sign recognition (Chen et al., 2018; Eykholt et al., 2018a;b; Huang et al., 2019), face recognition (Sharif et al., 2016; 2017), optical flow estimation (Ranjan et al., 2019), person detection(Thys et al., 2019; Wu  et al., 2020b; Xu et al., 2020), and LiDAR perception(Cao et al., 2019). In this work, we focus on two subsets of these physical-world attacks: local ones which place a printed pattern in a scene that does not overlap with the target object(Lee &  Kolter, 2019; Huang et al., 2019)  and global ones which attach a mainlytranslucent sticker on the lens of a camera(Li et al., 2019). Note that these physical-world attacks have corresponding digital-domain attacks, in which the attacker directly modifies the signal after it was received by the sensor and before it is processed by the model. The corresponding digital-domain

