PERCEPTUAL ADVERSARIAL ROBUSTNESS: DEFENSE AGAINST UNSEEN THREAT MODELS

Abstract

A key challenge in adversarial robustness is the lack of a precise mathematical characterization of human perception, used in the definition of adversarial attacks that are imperceptible to human eyes. Most current attacks and defenses try to avoid this issue by considering restrictive adversarial threat models such as those bounded by L 2 or L ∞ distance, spatial perturbations, etc. However, models that are robust against any of these restrictive threat models are still fragile against other threat models, i.e. they have poor generalization to unforeseen attacks. Moreover, even if a model is robust against the union of several restrictive threat models, it is still susceptible to other imperceptible adversarial examples that are not contained in any of the constituent threat models. To resolve these issues, we propose adversarial training against the set of all imperceptible adversarial examples. Since this set is intractable to compute without a human in the loop, we approximate it using deep neural networks. We call this threat model the neural perceptual threat model (NPTM); it includes adversarial examples with a bounded neural perceptual distance (a neural network-based approximation of the true perceptual distance) to natural images. Through an extensive perceptual study, we show that the neural perceptual distance correlates well with human judgements of perceptibility of adversarial examples, validating our threat model. Under the NPTM, we develop novel perceptual adversarial attacks and defenses. Because the NPTM is very broad, we find that Perceptual Adversarial Training (PAT) against a perceptual attack gives robustness against many other types of adversarial attacks. We test PAT on CIFAR-10 and ImageNet-100 against five diverse adversarial attacks: L 2 , L ∞ , spatial, recoloring, and JPEG. We find that PAT achieves state-of-the-art robustness against the union of these five attacks-more than doubling the accuracy over the next best model-without training against any of them. That is, PAT generalizes well to unforeseen perturbation types. This is vital in sensitive applications where a particular threat model cannot be assumed, and to the best of our knowledge, PAT is the first adversarial training defense with this property.

1. INTRODUCTION

Many modern machine learning algorithms are susceptible to adversarial examples: carefully crafted inputs designed to fool models into giving incorrect outputs (Biggio et al., 2013; Szegedy et al., 2014; Kurakin et al., 2016a; Xie et al., 2017) . Much research has focused on increasing classifiers' robustness against adversarial attacks (Goodfellow et al., 2015; Madry et al., 2018; Zhang et al., 2019a) . However, existing adversarial defenses for image classifiers generally consider simple threat models. An adversarial threat model defines a set of perturbations that may be made to an image in order to produce an adversarial example. Common threat models include L 2 and L ∞ threat models, which constrain adversarial examples to be close to the original image in L 2 or L ∞ distances. Some work has proposed additional threat models which allow spatial perturbations (Engstrom et al., 2017; Wong et al., 2019; Xiao et al., 2018), recoloring (Hosseini and Poovendran, 2018; Laidlaw and Feizi, 2019; Bhattad et al., 2019) , and other modifications (Song et al., 2018; Zeng et al., 2019 ) of an image. There are multiple issues with these unrealistically constrained adversarial threat models. First, hardening against one threat model assumes that an adversary will only attempt attacks within that threat model. Although a classifier may be trained to be robust against L ∞ attacks, for instance, an attacker could easily generate a spatial attack to fool the classifier. One possible solution is to train against multiple threat models simultaneously (Jordan et al., 2019; Laidlaw and Feizi, 2019; Maini et al., 2019; Tramer and Boneh, 2019) . However, this generally results in a lower robustness against any one of the threat models when compared to hardening against that threat model alone. Furthermore, not all possible threat models may be known at training time, and adversarial defenses do not usually generalize well to unforeseen threat models (Kang et al., 2019) . The ideal solution to these drawbacks would be a defense that is robust against a wide, unconstrained threat model. We differentiate between two such threat models. The unrestricted adversarial threat model (Brown et al., 2018) encompasses any adversarial example that is labeled as one class by a classifier but a different class by humans. On the other hand, we define the perceptual adversarial threat model as including all perturbations of natural images that are imperceptible to a human. Most existing narrow threat models such as L 2 , L ∞ , etc. are near subsets of the perceptual threat model (Figure 1 ). Some other threat models, such as adversarial patch attacks (Brown et al., 2018) , may perceptibly alter an image without changing its true class and as such are contained in the unrestricted adversarial threat model. In this work, we focus on the perceptual threat model. The perceptual threat model can be formalized given the true perceptual distance d * (x 1 , x 2 ) between images x 1 and x 2 , defined as how different two images appear to humans. For some threshold * , which we call the perceptibility threshold, images x and x are indistinguishable from one another as long as d * (x, x ) ≤ * . Note that in general * may depend on the specific input. Then, the perceptual threat model for a natural input x includes all adversarial examples x which cause misclassification but are imperceptibly different from x, i.e. d * (x, x) ≤ * . The true perceptual distance d * (•, •), however, cannot be easily computed or optimized against. To solve this issue, we propose to use a neural perceptual distance, an approximation of the true perceptual distance between images using neural networks. Fortunately, there have been many surrogate perceptual distances proposed in the computer vision literature such as SSIM (Wang et al., 2004 ). Recently, Zhang et al. (2018) discovered that comparing the internal activations of a convolutional neural network when two different images are passed through provides a measure, Learned Perceptual Image Patch Similarity (LPIPS), that correlates well with human perception. We propose to use the LPIPS distance d(•, •) in place of the true perceptual distance d * (•, •) to formalize the neural perceptual threat model (NPTM). We present adversarial attacks and defenses for the proposed NPTM. Generating adversarial examples bounded by the neural perceptual distance is difficult compared to generating L p adversarial examples because of the complexity and non-convexness of the constraint. However, we develop two attacks for the NPTM, Perceptual Projected Gradient Descent (PPGD) and Lagrangian Perceptual Attack (LPA) (see Section 4 for details). We find that LPA is by far the strongest adversarial attack at a given level of perceptibility (see Figure 4 ), reducing the most robust classifier studied to only 2.4%



Relationships between various adversarial threat models. L p and spatial adversarial attacks are nearly contained within the perceptual threat model, while patch attacks may be perceptible and thus are not contained. In this paper, we propose a neural perceptual threat model (NPTM) that is based on an approximation of the true perceptual distance using neural networks.

