EFFICIENT CERTIFIED DEFENSES AGAINST PATCH AT-TACKS ON IMAGE CLASSIFIERS

Abstract

Adversarial patches pose a realistic threat model for physical world attacks on autonomous systems via their perception component. Autonomous systems in safety-critical domains such as automated driving should thus contain a fail-safe fallback component that combines certifiable robustness against patches with efficient inference while maintaining high performance on clean inputs. We propose BAGCERT, a novel combination of model architecture and certification procedure that allows efficient certification. We derive a loss that enables end-to-end optimization of certified robustness against patches of different sizes and locations. On CIFAR10, BAGCERT certifies 10.000 examples in 43 seconds on a single GPU and obtains 86% clean and 60% certified accuracy against 5 × 5 patches.

1. INTRODUCTION

Adversarial patches (Brown et al., 2017) are one of the most relevant threat models for attacks on autonomous systems such as highly automated cars or robots. In this threat model, an attacker can freely control a small subregion of the input (the "patch") but needs to leave the rest of the input unchanged. This threat model is relevant because it corresponds to a physically realizable attack (Lee & Kolter, 2019) : an attacker can print the adversarial patch pattern, place it in the physical world, and it will become part of the input of any system whose field of view overlaps with the physical patch. Moreover, once an attacker has generated a successful patch pattern, this pattern can be easily shared, will be effective against all systems using the same perception component, and an attack can be conducted without requiring access to the individual system. This makes for instance attacking an entire fleet of cars of the same vendor feasible. While several empirical defenses were proposed (Hayes, 2018; Naseer et al., 2019; Selvaraju et al., 2019; Wu et al., 2020) ), these only offer robustness against known attacks but not necessarily against more effective attacks that may be developed in the future (Chiang et al., 2020) . In contrast, certified defenses for the patch threat model (Chiang et al., 2020; Levine & Feizi, 2020; Zhang et al., 2020; Xiang et al., 2020) allow guaranteed robustness against all possible attacks for the given threat model. Ideally, a certified defense should combine high certified robustness with efficient inference while maintaining strong performance on clean inputs. Moreover, the training objective should be based on the certification problem to avoid post-hoc calibration of the model for certification. Existing defenses do not satisfy all of these conditions: Chiang et al. (2020) proposed an approach that extends interval-bound propagation (Gowal et al., 2019) to the patch threat model. In this approach, there is a clear connection between training objective and certification problem. However, certified accuracy is relatively low and clean performance severely affected (below 50% on CI-FAR10). Moreover, inference requires separate forward passes for all possible patch positions and is thus computationally very expensive. Derandomized smoothing (Levine & Feizi, 2020) achieves much higher certified and clean performance on CIFAR10 and even scales to ImageNet. However, inference is computationally expensive since it is based on separately propagating many differently ablated versions of a single input. Moreover, training and certification are disconnected and a separate tuning of parameters of the post-hoc certification procedure on some hold-out data is required, a drawback shared also by Clipped BagNet Zhang et al. (2020) and PatchGuard (Xiang et al., 2020) . In this work, we propose BAGCERT, which combines high certified accuracy (60% on CIFAR10 for 5×5 patches) and clean performance (86% on CIFAR10), efficient inference (43 seconds on a single GPU for the 10.000 CIFAR10 test samples), and end-to-end training for robustness against patches of varying size, aspect ratio, and location. BAGCERT is based on the following contributions: • We propose three different conditions that can be checked for certifying robustness. One of these corresponds to the condition proposed by Levine & Feizi (2020) . However, we show that an alternative condition improves certified accuracy of the same model typically by roughly 3 percent points while remaining broadly applicable. • We derive a loss function that directly optimizes for certified accuracy against a uniform distribution of patch sizes at arbitrary positions. This loss corresponds to a specific type of the well known class of margin losses. Certified Defenses Evaluating defense methods using their performance against empirical attacks can lead to the false sense of security since stronger adversaries might be developed in the future that break the defenses (Athalye et al., 2018a; Uesato et al., 2018) . Therefore, it is important to have guarantees of robustness. Numerous works were proposed in the field of certified robustness ranging from complete verifiers finding the worst-case adversarial examples exactly (Huang et al., 2017; Tjeng & Tedrake, 2017) to faster but less accurate incomplete methods that provide an upper bound on the robust error (Gehr et al., 2018; Wong & Kolter, 2018; Wong et al., 2018; Gowal et al., 2019) . Another line of work is based on Randomized Smoothing (Lecuyer et al., 2019; Li et al., 2019; Cohen et al., 2019) , which exhibits strong empirical results and scales to ImageNet, however at the cost of increasing inference time by orders of magnitude. Certified defenses crafted for the patch attacks were first proposed by Chiang et al. (2020) . They adapt the IBP method (Gowal et al., 2019) to the patch threat model. Although their approach allows to obtain robustness guarantees, it only scales to small patches and causes a significant drop in clean accuracy. Levine & Feizi (2020) 



• Similarly toLevine & Feizi (2020), we classify images via a majority voting over a large number of predictions that are based on small local regions of a single input. However, the proposed model achieves this via a single forward-pass on the unmodified input, by utilizing a neural network architecture with very small receptive fields, similar to BagNets(Brendel & Bethge, 2019). This enables efficient inference with surprisingly high clean accuracy and was concurrently proposed byZhang et al. (2020) and Xiang et al. (2020).Heuristic Defenses Against Patch Attacks Several heuristic defenses against adversarial patches such as digital watermarking(Hayes, 2018)  or local gradient smoothing(Naseer et al., 2019)  have been proposed. However, similarly to the results obtained for the norm-bounded adversarial attacks(Athalye et al., 2018a), it was demonstrated that these defenses can be easily broken by white-box attacks which account for the pre-processing steps in the optimization procedure(Chiang et al.,  2020). The role of spatial context in the object detection algorithms which makes them vulnerable to the patch attacks was investigated by Saha et al. (2019) and an empirical defense based on Grad-CAM (Selvaraju et al., 2019) was proposed. Existing augmentation techniques based on adding Gaussian noise patch (Lopes et al., 2019) or a patch from a different image (Yun et al., 2019) increase robustness against occlusions caused by adversarial patches. Wu et al. (2020) propose a defense that uses adversarial training to increase robustness against occlusion attacks.

