CERTIFIED ROBUSTNESS AGAINST PHYSICALLY

Abstract

This paper proposes a certifiable defense against adversarial patch attacks on image classification. Our approach classifies random crops from the original image independently and classifies the original image as the majority vote over predicted classes of the crops. This process minimizes changes to the training process, as only the crop classification model needs to be trained, and can be trained in a standard manner without explicit adversarial training. Leveraging the fact that a patch attack can only influence a certain number of pixels in the image, we derive certified robustness bounds for the classifier. Our method is particularly effective when realistic transformations are applied to the adversarial patch, such as affine transformations. Such transformations occur naturally when an adversarial patch is physically introduced in a scene. Our method improves upon the current state of the art in defending against patch attacks on CIFAR10 and ImageNet, both in terms of certified accuracy and inference time.

1. INTRODUCTION

Despite their incredible success in many computer vision tasks, deep neural networks are known to be sensitive to adversarial attacks; small perturbations to an input image can lead to large changes in the output. A wide range of defenses against adversarial attacks have been conducted in image classification, where the goal of the attacker is simply to change the predicted label(s) of an image (Kurakin et al., 2016a; Szegedy et al., 2013; Madry et al., 2017) . But these defenses have typically considered a relatively unrealistic threat model that does not easily extend to the physical settings. In particular, these works have mainly considered the so-called p -norm threat model, where an attacker is allowed to perturb the intensity at all pixels of the input image by a small amount. In contrast, adversarial patch attacks are considered as physically-realizable alternatives, modeling scenaria where a small object is placed in the scene so as to alter or suppress classification results (Brown et al., 2017) . Here, the attack is spatially compact, but can change the pixel value to any value within an allowable range. This paper develops a practical and provably robust defense against patch attacks. Inspired by the randomized smoothing defense (Cohen et al., 2019; Levine & Feizi, 2019) for the p -norm threat model, our approach classifies randomly sampled sub-regions or crops of an image independently and outputs the majority vote across these crops as the class prediction of the input image. This approach has numerous benefits. First, given the size of adversarial patches, we can compute the probability of a sampled crop overlapping with the attacked region (patch), and use this probability to determine if the classification outcome of an image can be guaranteed (certified) to not be changed by any adversarial patch. Second, this approach is highly practical, as the crop classifier can be trained using standard architectures such as VGG (Simonyan & Zisserman, 2014) or ResNet (He et al., 2016) without the need for adversarial training. Indeed, random cropping is already a common data augmentation strategy for training machine learning models, and thus the method can be trained via standard techniques. This is different from most existing work on certifiable defenses against patch attacks (Levine & Feizi, 2020; Xiang et al., 2020; Chiang et al., 2020) which need extra computation for certification during training. Third, the proposed approach separates the training procedure from the patch threat model, thus making the method more robust against realistic settings of patch attacks, for example, patch transformations including rotation in x-y plane and aspect ratio

