CERTIFIED ROBUSTNESS AGAINST PHYSICALLY

Abstract

This paper proposes a certifiable defense against adversarial patch attacks on image classification. Our approach classifies random crops from the original image independently and classifies the original image as the majority vote over predicted classes of the crops. This process minimizes changes to the training process, as only the crop classification model needs to be trained, and can be trained in a standard manner without explicit adversarial training. Leveraging the fact that a patch attack can only influence a certain number of pixels in the image, we derive certified robustness bounds for the classifier. Our method is particularly effective when realistic transformations are applied to the adversarial patch, such as affine transformations. Such transformations occur naturally when an adversarial patch is physically introduced in a scene. Our method improves upon the current state of the art in defending against patch attacks on CIFAR10 and ImageNet, both in terms of certified accuracy and inference time.

1. INTRODUCTION

Despite their incredible success in many computer vision tasks, deep neural networks are known to be sensitive to adversarial attacks; small perturbations to an input image can lead to large changes in the output. A wide range of defenses against adversarial attacks have been conducted in image classification, where the goal of the attacker is simply to change the predicted label(s) of an image (Kurakin et al., 2016a; Szegedy et al., 2013; Madry et al., 2017) . But these defenses have typically considered a relatively unrealistic threat model that does not easily extend to the physical settings. In particular, these works have mainly considered the so-called p -norm threat model, where an attacker is allowed to perturb the intensity at all pixels of the input image by a small amount. In contrast, adversarial patch attacks are considered as physically-realizable alternatives, modeling scenaria where a small object is placed in the scene so as to alter or suppress classification results (Brown et al., 2017) . Here, the attack is spatially compact, but can change the pixel value to any value within an allowable range. This paper develops a practical and provably robust defense against patch attacks. Inspired by the randomized smoothing defense (Cohen et al., 2019; Levine & Feizi, 2019) for the p -norm threat model, our approach classifies randomly sampled sub-regions or crops of an image independently and outputs the majority vote across these crops as the class prediction of the input image. This approach has numerous benefits. First, given the size of adversarial patches, we can compute the probability of a sampled crop overlapping with the attacked region (patch), and use this probability to determine if the classification outcome of an image can be guaranteed (certified) to not be changed by any adversarial patch. Second, this approach is highly practical, as the crop classifier can be trained using standard architectures such as VGG (Simonyan & Zisserman, 2014) or ResNet (He et al., 2016) without the need for adversarial training. Indeed, random cropping is already a common data augmentation strategy for training machine learning models, and thus the method can be trained via standard techniques. This is different from most existing work on certifiable defenses against patch attacks (Levine & Feizi, 2020; Xiang et al., 2020; Chiang et al., 2020) which need extra computation for certification during training. Third, the proposed approach separates the training procedure from the patch threat model, thus making the method more robust against realistic settings of patch attacks, for example, patch transformations including rotation in x-y plane and aspect ratio We report certified accuracy, which is the percentage of test images for which classification outcome equals to the ground truth label and is guaranteed to not change under patch attack. Our method is better in both speed and certified accuracy compared to De-randomized smoothing (Levine & Feizi, 2020) and PatchGuard (Xiang et al., 2020) under patch attack with possible affine transformations of the patch. In addition, our method outperforms these past approaches on ImageNet (though not on CIFAR10) in the setting where the patch aligns with coordinates of the image axes and does not undergo affine transformations as in Table 2 , which was the setting considered in this past work. We have made several contributions in this paper: first, we propose a defense against patch attack for image classification with certified robustness; second, the proposed method is fast in computing image certification and robust against patch transformation; third, the proposed method can be applied to any image classification model with only minimal changes to the training process.

2. BACKGROUND AND RELATED WORK

Adversarial attacks Adversarial attacks on image classification have been known for some time, with original work coming out of the field of robust optimization (Ben-Tal et al., 2009) . Testtime attacks on ML models in general were studied in (Dalvi et al., 2004; Biggio et al., 2013) , though the area gained momentum considerably when these methods were applied to deep learning systems to demonstrate that deep classifiers could be easily fooled by imperceptible changes to images (Szegedy et al., 2013; Goodfellow et al., 2014) . In the following years many defenses against such attacks were proposed (Tramèr et al., 2017; Papernot et al., 2016) , although most heuristic approaches were later found to be ineffective (Athalye et al., 2018) . Amongst the defense strategies that have stood the test of time are: 1) adversarial training (Goodfellow et al., 2014; Kurakin et al., 2016b; Madry et al., 2017) , now commonly carried out using a projected gradient descent based approach for synthesizing adversarial attacks and then incorporating them into the training; and 2) provably robust training (Wong & Kolter, 2017; Raghunathan et al., 2018) . Our approach is more related to the randomized smoothing-based methods (Cohen et al., 2019; Levine & Feizi, 2019) in the latter direction, in which random points around the original input are sampled and classified, and the predicted class of the original input is declared as the aggregation of these outputs. The majority of the attacks mentioned above focus on attacks with bounded ∞ -norm, where attacks are permitted to modify any pixel in the image by (at most) some fixed amount (and are usually permitted to design a new adversarial perturbation for each new input image, though the so-called "universal" ∞ attacks have been studied as well (Moosavi-Dezfooli et al., 2017) ). If we consider real-life attackers, however, this level of freedom on the attacker's side seems uncommon: attackers



Worst-case certified accuracy (%), clean accuracy(%), and certification time of the proposed method, De-randomized smoothing(Levine & Feizi, 2020), and PatchGuard (Xiang et al., 2020)   with De-randomized smoothing and Bagnets as base structure. For each method, we list the worst certified accuracy under affine transformation of the patch. Note that this is different from results in the original paper where patch transformations are not considered.changes. This is in sharp contrast to prior work that either have fixed strategies to exclude parts of the image or extract features from fixed parts of the image(Levine & Feizi, 2020; Xiang et al.,  2020).We summarize our main results on CIFAR10 and ImageNet in Table1in comparison with the current state of the art certifiable defense against patch attack(Xiang et al., 2020)  with patch transformation.

