

Abstract

Recently demonstrated physical-world adversarial attacks have exposed vulnerabilities in perception systems that pose severe risks for safety-critical applications such as autonomous driving. These attacks place adversarial artifacts in the physical world that indirectly cause the addition of universal perturbations to inputs of a model that can fool it in a variety of contexts. Adversarial training is the most effective defense against image-dependent adversarial attacks. However, tailoring adversarial training to universal perturbations is computationally expensive since the optimal universal perturbations depend on the model weights which change during training. We propose meta adversarial training (MAT), a novel combination of adversarial training with meta-learning, which overcomes this challenge by meta-learning universal perturbations along with model training. MAT requires little extra computation while continuously adapting a large set of perturbations to the current model. We present results for universal patch and universal perturbation attacks on image classification and traffic-light detection. MAT considerably increases robustness against universal patch attacks compared to prior work.

1. INTRODUCTION

Deep learning is currently the most promising method for open-world perception tasks such as in automated driving and robotics. However, the use in safety-critical domains is questionable, since a lack of robustness of deep learning-based perception has been demonstrated (Szegedy et al., 2014; Goodfellow et al., 2015; Metzen et al., 2017; Hendrycks & Dietterich, 2019) . attack for the adversarial camera sticker is a type of universal adversarial perturbation (Moosavi-Dezfooli et al., 2017) , while the digital adversarial patch attack (Brown et al., 2017; Anonymous, 2020) corresponds to physical patch attacks (Lee & Kolter, 2019; Huang et al., 2019) . We focus on increasing robustness against digital-domain attacks. Digital-domain attacks are strictly stronger than the corresponding physical-world attacks since they allow the attacker to have complete control over the change of the signal. In contrast, physical-world attacks need to be invariant under non-controllable effects such as scale, rotation, object position, and light conditions, which cannot be controlled by the attacker. Therefore, a system robust against digital-domain attacks is also robust against the corresponding physical-world attacks. Currently, the most promising method for increasing robustness against adversarial attacks is adversarial training (Goodfellow et al., 2015; Madry et al., 2018) These approaches face the challenge of balancing the implicit trade-off between simulating universal perturbation attacks accurately and keeping computation cost of the proxy attacks small. We propose meta adversarial training (MAT)foot_1 , which falls into the category of proxy attacks. MAT combines adversarial training with meta-learning. We summarize the key novel contributions of MAT and refer to Section 3 for details: • MAT amortizes the cost of computing universal perturbations by sharing information about optimal perturbations over consecutive steps of model training, which reduces the cost of generating strong approximations of universal perturbations considerably. In contrast to UAT (Shafahi et al., 2018) , MAT uses meta-learning for sharing of information rather than joint training, which empirically generates stronger perturbations and a more robust model. • MAT meta-learns a large set of perturbations concurrently. While a model easily overfits a single perturbation, even if it changes as in UAT, overfitting is much less likely for a larger set of perturbations such as those generated with MAT. • MAT encourages diversity of the generated perturbations by assigning random but fixed target classes and step-sizes to each perturbation during meta-learning. This avoids that many perturbations focus on exploiting the same vulnerability of a model. We perform an extensive empirical evaluation and ablation study of MAT on image classification and traffic-light detection tasks against a variety of attacks to show the robustness of MAT against universal patches and perturbations (see Section 4). We refer to Figure 1 for an illustration of MAT for universal patch attacks against traffic light detection.

2. RELATED WORK

We review work on generating universal perturbations, defending against them, and meta-learning. Generating Universal Perturbations Adversarial perturbations are changes to the input that are crafted with the intention of fooling a model's prediction on the input. Universal perturbations are a special case in which one perturbation needs to be effective on the majority of samples from the input distribution. Most work focuses on small additive perturbations that are bounded by some p -norm constraint. For example, Moosavi-Dezfooli et al. (2017) proposed the first approach by extending the DeepFool algorithm (Moosavi-Dezfooli et al., 2016) . Similarly, Metzen et al. (2017) extended the iterative fast gradient sign method (Kurakin et al., 2017) for generating universal perturbations on semantic image segmentation. Mopuri et al. (2017; 2018) presented data-independent attacks and



Physical-world adversarial attacks(Kurakin et al., 2017; Athalye et al., 2018; Braunegg et al., 2020) are one of most problematic failures in robustness of deep learning. Examples of such attacks are fooling models for traffic sign recognition(Chen et al., 2018; Eykholt et al., 2018a;b; Huang et al., 2019), face recognition(Sharif et al., 2016; 2017), optical flow estimation(Ranjan et al., 2019), person detection(Thys et al., 2019; Wu et al., 2020b; Xu et al., 2020), and LiDAR perception(Cao et al., 2019). In this work, we focus on two subsets of these physical-world attacks: local ones which place a printed pattern in a scene that does not overlap with the target object(Lee & Kolter, 2019; Huang et al., 2019) and global ones which attach a mainlytranslucent sticker on the lens of a camera(Li et al., 2019). Note that these physical-world attacks have corresponding digital-domain attacks, in which the attacker directly modifies the signal after it was received by the sensor and before it is processed by the model. The corresponding digital-domain Code will be publicly released upon acceptance and can be found in the supplementary material.



Figure 1: Illustration of a digital universal patch attack against an undefended model (left) and a model defended with meta adversarial training (MAT, right) on Bosch Small Traffic Lights (Behrendt & Novak, 2017). A patch can lead the undefended model to detect non-existent traffic lights and miss real ones that would be detected without the patch (bottom left). In contrast, the same patch is ineffective against a MAT model (bottom right). Moreover, a patch optimized for the MAT model (top right), which bears a resemblance to traffic lights, does not cause the model to remove correct detections.

. Adversarial training simulates an adversarial attack for every mini-batch and trains the model to become robust against such an attack. Adversarial training against digital-domain universal perturbations or patches is complicated by the fact that these attacks are computationally much more expensive than image-dependent adversarial attacks and existing approaches for speeding up adversarial training(Shafahi et al., 2019; Zhang et al.,  2019; Zheng et al., 2020)  are not directly applicable. Existing approaches for tailoring adversarial training to universal perturbations or patches either refrain from simulating attacks in every minibatch(Moosavi-Dezfooli et al., 2017; Hayes & Danezis, 2018; Perolat et al., 2018), which bears the risk that the model easily overfits these fixed or rarely updated universal perturbations. Alternative approaches use proxy attacks that are computationally cheaper such as "universal adversarial train-

