PERCEPTUAL ADVERSARIAL ROBUSTNESS: DEFENSE AGAINST UNSEEN THREAT MODELS

Abstract

A key challenge in adversarial robustness is the lack of a precise mathematical characterization of human perception, used in the definition of adversarial attacks that are imperceptible to human eyes. Most current attacks and defenses try to avoid this issue by considering restrictive adversarial threat models such as those bounded by L 2 or L ∞ distance, spatial perturbations, etc. However, models that are robust against any of these restrictive threat models are still fragile against other threat models, i.e. they have poor generalization to unforeseen attacks. Moreover, even if a model is robust against the union of several restrictive threat models, it is still susceptible to other imperceptible adversarial examples that are not contained in any of the constituent threat models. To resolve these issues, we propose adversarial training against the set of all imperceptible adversarial examples. Since this set is intractable to compute without a human in the loop, we approximate it using deep neural networks. We call this threat model the neural perceptual threat model (NPTM); it includes adversarial examples with a bounded neural perceptual distance (a neural network-based approximation of the true perceptual distance) to natural images. Through an extensive perceptual study, we show that the neural perceptual distance correlates well with human judgements of perceptibility of adversarial examples, validating our threat model. Under the NPTM, we develop novel perceptual adversarial attacks and defenses. Because the NPTM is very broad, we find that Perceptual Adversarial Training (PAT) against a perceptual attack gives robustness against many other types of adversarial attacks. We test PAT on CIFAR-10 and ImageNet-100 against five diverse adversarial attacks: L 2 , L ∞ , spatial, recoloring, and JPEG. We find that PAT achieves state-of-the-art robustness against the union of these five attacks-more than doubling the accuracy over the next best model-without training against any of them. That is, PAT generalizes well to unforeseen perturbation types. This is vital in sensitive applications where a particular threat model cannot be assumed, and to the best of our knowledge, PAT is the first adversarial training defense with this property.

1. INTRODUCTION

Many modern machine learning algorithms are susceptible to adversarial examples: carefully crafted inputs designed to fool models into giving incorrect outputs (Biggio et al., 2013; Szegedy et al., 2014; Kurakin et al., 2016a; Xie et al., 2017) . Much research has focused on increasing classifiers' robustness against adversarial attacks (Goodfellow et al., 2015; Madry et al., 2018; Zhang et al., 2019a) . However, existing adversarial defenses for image classifiers generally consider simple threat models. An adversarial threat model defines a set of perturbations that may be made to an image in order to produce an adversarial example. Common threat models include L 2 and L ∞ threat models, which constrain adversarial examples to be close to the original image in L 2 or L ∞ distances. Some work has proposed additional threat models which allow spatial perturbations (Engstrom et al., 2017; Wong et al., 2019; Xiao et al., 2018) , recoloring (Hosseini and Poovendran, 2018; Laidlaw and Feizi, 2019; Bhattad et al., 2019) , and other modifications (Song et al., 2018; Zeng et al., 2019 ) of an image. There are multiple issues with these unrealistically constrained adversarial threat models. First, hardening against one threat model assumes that an adversary will only attempt attacks within that threat model. Although a classifier may be trained to be robust against L ∞ attacks, for instance, Neural Perceptual Attack (ours) Figure 2 : Area-proportional Venn diagram validating our threat model from Figure 1 . Each ellipse indicates a set of vulnerable ImageNet-100 examples to one of three attacks: L 2 , StAdv spatial (Xiao et al., 2018) , and our neural perceptual attack (LPA, Section 4). Percentages indicate the proportion of test examples successfully attacked. Remarkably, the NPTM encompasses both other types of attacks and includes additional examples not vulnerable to either. an attacker could easily generate a spatial attack to fool the classifier. One possible solution is to train against multiple threat models simultaneously (Jordan et al., 2019; Laidlaw and Feizi, 2019; Maini et al., 2019; Tramer and Boneh, 2019) . However, this generally results in a lower robustness against any one of the threat models when compared to hardening against that threat model alone. Furthermore, not all possible threat models may be known at training time, and adversarial defenses do not usually generalize well to unforeseen threat models (Kang et al., 2019) . The ideal solution to these drawbacks would be a defense that is robust against a wide, unconstrained threat model. We differentiate between two such threat models. The unrestricted adversarial threat model (Brown et al., 2018) encompasses any adversarial example that is labeled as one class by a classifier but a different class by humans. On the other hand, we define the perceptual adversarial threat model as including all perturbations of natural images that are imperceptible to a human. Most existing narrow threat models such as L 2 , L ∞ , etc. are near subsets of the perceptual threat model (Figure 1 ). Some other threat models, such as adversarial patch attacks (Brown et al., 2018) , may perceptibly alter an image without changing its true class and as such are contained in the unrestricted adversarial threat model. In this work, we focus on the perceptual threat model. The perceptual threat model can be formalized given the true perceptual distance d * (x 1 , x 2 ) between images x 1 and x 2 , defined as how different two images appear to humans. For some threshold * , which we call the perceptibility threshold, images x and x are indistinguishable from one another as long as d * (x, x ) ≤ * . Note that in general * may depend on the specific input. Then, the perceptual threat model for a natural input x includes all adversarial examples x which cause misclassification but are imperceptibly different from x, i.e. d * (x, x) ≤ * . The true perceptual distance d * (•, •), however, cannot be easily computed or optimized against. To solve this issue, we propose to use a neural perceptual distance, an approximation of the true perceptual distance between images using neural networks. Fortunately, there have been many surrogate perceptual distances proposed in the computer vision literature such as SSIM (Wang et al., 2004) . Recently, Zhang et al. (2018) discovered that comparing the internal activations of a convolutional neural network when two different images are passed through provides a measure, Learned Perceptual Image Patch Similarity (LPIPS), that correlates well with human perception. We propose to use the LPIPS distance d(•, •) in place of the true perceptual distance d * (•, •) to formalize the neural perceptual threat model (NPTM). We present adversarial attacks and defenses for the proposed NPTM. Generating adversarial examples bounded by the neural perceptual distance is difficult compared to generating L p adversarial examples because of the complexity and non-convexness of the constraint. However, we develop two attacks for the NPTM, Perceptual Projected Gradient Descent (PPGD) and Lagrangian Perceptual Attack (LPA) (see Section 4 for details). We find that LPA is by far the strongest adversarial attack at a given level of perceptibility (see Figure 4 ), reducing the most robust classifier studied to only 2.4% accuracy on ImageNet-100 (a subset of ImageNet) while remaining imperceptible. LPA also finds adversarial examples outside of any of the other threat models studied (see Figure 2 ). Thus, even if a model is robust to many narrow threat models (L p , spatial, etc.), LPA can still cause serious errors. In addition to these attacks, which are suitable for evaluation of a classifier against the NPTM, we also develop Fast-LPA, a more efficient version of LPA that we use in Perceptual Adversarial Training (PAT). Remarkably, using PAT to train a neural network classifier produces a single model with high robustness against a variety of imperceptible perturbations, including L ∞ , L 2 , spatial, recoloring, and JPEG attacks, on CIFAR-10 and ImageNet-100 (Tables 2 and 3 ). For example, PAT on ImageNet-100 gives 32.5% accuracy against the union of these five attacks, whereas L ∞ and L 2 adversarial training give 0.5% and 12.3% accuracy, respectively (Table 1 ). PAT achieves more than double the accuracy against this union of five threat models despite not explicitly training against any of them. Thus, it generalizes well to unseen threat models. 

2. RELATED WORK

Adversarial robustness Adversarial robustness has been studied extensively for L 2 or L ∞ threat models (Goodfellow et al., 2015; Carlini and Wagner, 2017; Madry et al., 2018) and non-L p threat models such as spatial perturbations (Engstrom et al., 2017; Xiao et al., 2018; Wong et al., 2019) , recoloring of an image (Hosseini and Poovendran, 2018; Laidlaw and Feizi, 2019; Bhattad et al., 2019) , and perturbations in the frequency domain (Kang et al., 2019) . The most popular known adversarial defense for these threat models is adversarial training Kurakin et al. (2016b) ; Madry et al. (2018) ; Zhang et al. (2019a) where a neural network is trained to minimize the worst-case loss in a region around the input. Recent evaluation methodologies such as Unforeseen Attack Robustness (UAR) (Kang et al., 2019) and the Unrestricted Adversarial Examples challenge (Brown et al., 2018) have raised the problem of finding an adversarial defense which gives good robustness under more general threat models. Sharif et al. (2018) conduct a perceptual study showing that L p threat models are a poor approximation of the perceptual threat model. Dunn et al. (2020) and Xu et al. (2020) have developed adversarial attacks that manipulate higher-level, semantic features. Jin and Rinard (2020) train with a manifold regularization term, which gives some robustness to unseen perturbation types. Stutz et al. (2020) also propose a method which gives robustness against unseen perturbation types, but requires rejecting (abstaining on) some inputs. Perceptual similarity Two basic similarity measures for images are the L 2 distance and the Peak Signal-to-Noise Ratio (PSNR). However, these similarity measures disagree with human vision on perturbations such as blurring and spatial transformations, which has motivated others including SSIM (Wang et al., 2004) , MS-SSIM (Wang et al., 2003) , CW-SSIM (Sampat et al., 2009) , HDR-VDP-2 (Mantiuk et al., 2011) and LPIPS (Zhang et al., 2018) . MAD competition (Wang and Simoncelli, 2008) uses a constrained optimization technique related to our attacks to evaluate perceptual measures. Perceptual adversarial robustness Although LPIPS was previously proposed, it has mostly been used for development and evaluation of generative models (Huang et al., 2018; Karras et al., 2019) . Jordan et al. (2019) first explored quantifying adversarial distortions with LPIPS distance. However, to the best of our knowledge, we are the first to apply a more accurate perceptual distance to the Original Self-bd. LPA External-bd. LPA Original Self-bd. LPA External-bd. LPA problem of improving adversarial robustness. As we show, adversarial defenses based on L 2 or L ∞ attacks are unable to generalize to a more diverse threat model. Our method, PAT, is the first adversarial training method we know of that can generalize to unforeseen threat models without rejecting inputs.

3. NEURAL PERCEPTUAL THREAT MODEL (NPTM)

Since the true perceptual distance between images cannot be efficiently computed, we use approximations of it based on neural networks, i.e. neural perceptual distances. In this paper, we focus on the LPIPS distance (Zhang et al., 2018) while we note that other neural perceptual distances can also be used in our attacks and defenses. Let g : X → Y be a convolutional image classifier network defined on images x ∈ X . Let g have L layers, and let the internal activations (outputs) of the l-th layer of g(x) for an input x be denoted as g l (x). Zhang et al. (2018) have found that normalizing and then comparing the internal activations of convolutional neural networks correlates well with human similarity judgements. Thus, the first step in calculating the LPIPS distance using the network g(•) is to normalize the internal activations across the channel dimension such that the L 2 norm over channels at each pixel is one. Let ĝl (x) denote these channel-normalized activations at the l-th layer of the network. Next, the activations are normalized again by layer size and flattened into a single vector φ(x) ĝ1(x) √ w1h1 , . . . , ĝL (x) √ w L h L where w l and h l are the width and height of the activations in layer l, respectively. The function φ : X → A thus maps the inputs x ∈ X of the classifier g(•) to the resulting normalized, flattened internal activations φ(x) ∈ A, where A ⊆ R m refers to the space of all possible resulting activations. The LPIPS distance d(x 1 , x 2 ) between images x 1 and x 2 is then defined as: d(x 1 , x 2 ) φ(x 1 ) -φ(x 2 ) 2 . In the original LPIPS implementation, Zhang et al. (2018) learn weights to apply to the normalized activations based on a dataset of human perceptual judgements. However, they find that LPIPS is a good surrogate for human vision even without the additional learned weights; this is the version we use since it avoids the need to collect such a dataset. Now let f : X → Y be a classifier which maps inputs x ∈ X to labels f (x) ∈ Y. f (•) could be the same as g(•), or it could be a different network; we experiment with both. For a given natural input x with the true label y, a neural perceptual adversarial example with a perceptibility bound is an input x ∈ X such that x must be perceptually similar to x but cause f to misclassify: f ( x) = y and d(x, x) = φ(x) -φ( x) 2 ≤ . (2)

4. PERCEPTUAL ADVERSARIAL ATTACKS

We propose attack methods which attempt to find an adversarial example with small perceptual distortion. Developing adversarial attacks that utilize the proposed neural perceptual threat model is more difficult than that of standard L p threat models, because the LPIPS distance constraint is more complex than L p constraints. In general, we find an adversarial example that satisfies (2) by maximizing a loss function L within the LPIPS bound. The loss function we use is similar to the margin loss from Carlini and Wagner (2017) , defined as L(f (x), y) max i =y z i (x) -z y (x) , where z i (x) is the i-th logit output of the classifier f (•). This gives the constrained optimization max x L(f ( x), y) subject to d(x, x) = φ(x) -φ( x) 2 ≤ . Note that in this attack problem, the classifier network f (•) and the LPIPS network g(•) are fixed. These two networks could be identical, in which case the same network that is being attacked is used to calculate the LPIPS distance that bounds the attack; we call this a self-bounded attack. If a different network is used to calculate the LPIPS bound, we call it an externally-bounded attack. Based on this formulation, we propose two perceptual attack methods, Perceptual Projected Gradient Descent (PPGD) and Lagrangian Perceptual Attack (LPA). See Figures 3 and 7 for sample results. Perceptual Projected Gradient Descent (PPGD) The first of our two attacks is analogous to the PGD (Madry et al., 2018) attacks used for L p threat models. In general, these attacks consist of iteratively performing two steps on the current adversarial example candidate: (a) taking a step of a certain size under the given distance that maximizes a first-order approximation of the misclassification loss, and (b) projecting back onto the feasible set of the threat model. Identifying the ideal first-order step is easy in L 2 and L ∞ threat models; it is the gradient of the loss function and the sign of the gradient, respectively. However, computing this step is not straightforward with the LPIPS distance, because the distance metric itself is defined by a neural network. Following (3), we desire to find a step δ to maximize L(f (x + δ), y) such that d(x + δ, x) = φ(x + δ) -φ(x) 2 ≤ η, where η is the step size. Let f (x) := L(f (x), y) for an input x ∈ X . Let J be the Jacobian of φ(•) at x and ∇ f be the gradient of f (•) at x. Then, we can approximate (3) using a first-order Taylor's approximation of φ and f as follows: max δ f (x) + (∇ f ) δ subject to Jδ 2 ≤ η. We show that this constrained optimization can be solved in a closed form: Lemma 1. Let J + denote the pseudoinverse of J. Then the solution to (4) is given by δ * = η (J J) -1 (∇ f ) (J + ) (∇ f ) 2 . See Appendix A.1 for the proof. This solution is still difficult to efficiently compute, since calculating J + and inverting J J are computationally expensive. Thus, we approximately solve for δ * using the conjugate gradient method; see Appendix A.1 for details. Perceptual PGD consists of repeatedly finding first-order optimal δ * to add to the current adversarial example x for a number of steps. Following each step, if the current adversarial example x is outside the LPIPS bound, we project x back onto the threat model such that d( x, x) ≤ . The exact projection is again difficult due to the non-convexity of the feasible set. Thus, we solve it approximately with a technique based on Newton's method; see Algorithm 4 in the appendix. Lagrangian Perceptual Attack (LPA) The second of our two attacks uses a Lagrangian relaxation of the attack problem (3) similar to that used by Carlini and Wagner (2017) for constructing L 2 and L ∞ adversarial examples. We call this attack the Lagrangian Perceptual Attack (LPA). To derive the attack, we use the following Lagrangian relaxation of (3): max x L(f ( x), y) -λ max 0, φ( x) -φ(x) 2 -. The perceptual constraint cost, multiplied by λ in (5), is designed to be 0 as long as the adversarial example is within the allowed perceptual distance; i.e. d( x, x) ≤ ; once d( x, x) > , however, it increases linearly by the LPIPS distance from the original input x. Similar to L p attacks of Carlini and Wagner (2017) , we adaptively change λ to find an adversarial example within the allowed perceptual distance; see Appendix A.2 for details.

5. PERCEPTUAL ADVERSARIAL TRAINING (PAT)

The developed perceptual attacks can be used to harden a classifier against a variety of adversarial attacks. The intuition, which we verify in Section 7, is that if a model is robust against neural perceptual attacks, it can demonstrate an enhanced robustness against other types of unforeseen adversarial attacks. Inspired by adversarial training used to robustify models against L p attacks, we propose a method called Perceptual Adversarial Training (PAT). Suppose we wish to train a classifier f (•) over a distribution of inputs and labels (x, y) ∼ D such that it is robust to the perceptual threat model with bound . Let L ce denote the cross entropy (negative log likelihood) loss and suppose the classifier f (•) is parameterized by θ f . Then, PAT consists of optimizing f (•) in a manner analogous to L p adversarial training (Madry et al., 2018) : min θ f E (x,y)∼D max d( x,x)≤ L ce (f ( x), y) . The training formulation attempts to minimize the worst-case loss within a neighborhood of each training point x. In PAT, the neighborhood is bounded by the LPIPS distance. Recall that the LPIPS distance is itself defined based on a particular neural network classifier. We refer to the normalized, flattened activations of the network used to define LPIPS as φ(•) and θ φ to refer to its parameters. We explore two variants of PAT differentiated by the choice of the network used to define φ(•). In externally-bounded PAT, a separate, pretrained network is used to calculate φ(•), the LPIPS distance d(•, •). In self-bounded PAT, the same network which is being trained for classification is used to calculate the LPIPS distance, i.e. θ φ ⊆ θ f . Note that in self-bounded PAT the definition of the LPIPS distance changes during the training as the classifier is optimized. The inner maximization in ( 6) is intractable to compute exactly. 

6. PERCEPTUAL EVALUATION

We conduct a thorough perceptual evaluation of our NPTM and attacks to ensure that the resulting adversarial examples are imperceptible. We also compare the perceptibility of perceptual adversarial attacks to five narrow threat models: L ∞ and L 2 attacks, JPEG attacks (Kang et al., 2019) , spatially transformed adversarial examples (StAdv) (Xiao et al., 2018) , and functional adversarial attacks (ReColorAdv) (Laidlaw and Feizi, 2019) . The comparison allows us to determine if the LPIPS distance is a good surrogate for human comparisons of similarity. It also allows us to set bounds across threat models with approximately the same level of perceptibility. To determine how perceptible a particular threat model is at a particular bound (e.g. L ∞ attacks at = 8/255), we perform an experiment based on just noticeable differences (JND). We show pairs of images to participants on Amazon Mechanical Turk (AMT), an online crowdsourcing platform. In each pair, one image is a natural image from ImageNet-100 and one image is an adversarial perturbation of the natural image, generated using the particular attack against a classifier hardened to that attack. One of the images, chosen randomly, is shown for one second, followed by a blank screen for 250ms, followed by the second image for one second. Then, participants must choose whether they believe the images are the same or different. This procedure is identical to that used by Zhang et al. (2018) to originally validate the LPIPS distance. We report the proportion of pairs for which participants report the images are "different" as the perceptibility of the attack. In addition to adversarial example pairs, we also include sentinel image pairs which are exactly the same; only 4.1% of these were annotated as "different." We collect about 1,000 annotations of image pairs for each of 3 bounds for all five threat models, plus our PPGD and LPA attacks (14k annotations total for 2.8k image pairs). The three bounds for each attack are labeled as small, medium, and large; bounds with the same label have similar perceptibility across threat models (see Appendix D Table 4 ). The dataset of image pairs and associated annotations is available for use by the community. To determine if the LPIPS threat model is a good surrogate for the perceptual threat model, we use various classifiers to calculate the LPIPS distance d(•, •) between the pairs of images used in the perceptual study. For each classifier, we determine the correlation between the mean LPIPS distance it assigns to image pairs from each attack and the perceptibility of that attack (Figure 4c ). We find that AlexNet (Krizhevsky et al., 2012) , trained normally on ImageNet (Russakovsky et al., 2015) , correlates best with human perception of these adversarial examples (r = 0.94); this agrees with Zhang et al. (2018) who also find that AlexNet-based LPIPS correlates best with human perception (Figure 4 ). A normally trained ResNet-50 correlates similarly, but not quite as well. Because AlexNet is the best proxy for human judgements of perceptual distance, we use it for all externally-bounded evaluation attacks. Note that even with an untrained network at initialization, the LPIPS distance correlates with human perception better than the L 2 distance. This means that even during the first few epochs of self-bounded PAT, the training adversarial examples are perceptually-aligned. We use the results of the perceptual study to investigate which attacks are strongest at a particular level of perceptibility. We evaluate each attack on a classifier hardened against that attack via adversarial training, and plot the resulting success rate against the proportion of correct annotations from the perceptual study. Out of the narrow threat models, we find that L 2 attacks are the strongest for their perceptibility. However, our proposed PPGD and LPA attacks reduce a PAT-trained classifier to even lower accuracies (8.2% for PPGD and 0% for LPA), making it the strongest attack studied.

7. EXPERIMENTS

We compare Perceptual Adversarial Training (PAT) to adversarial training against narrow threat models (L p , spatial, etc.) on CIFAR-10 ( Krizhevsky and Hinton, 2009) and ImageNet-100 (the subset of ImageNet (Russakovsky et al., 2015) containing every tenth class by WordNet ID order). We find that PAT results in classifiers with robustness against a broad range of narrow threat models. We also show that our perceptual attacks, PPGD and LPA, are strong against adversarial training with narrow threat models. We evaluate with externally-bounded PPGD and LPA (Section 4), using AlexNet to determine the LPIPS bound because it correlates best with human judgements (Figure 4c ). For L 2 and L ∞ robustness evaluation we use AutoAttack (Croce and Hein, 2020) , which combines four strong attacks, including two PGD variants and a black box attack, to give reliable evaluation. Evaluation metrics For both datasets, we evaluate classifiers' robustness to a range of threat models using two summary metrics. First, we compute the union accuracy against all narrow threat models (L ∞ , L 2 , StAdv, ReColorAdv, and JPEG for ImageNet-100); this is the proportion of inputs for which a classifier is robust against all these attacks. Second, we compute the unseen mean accuracy, which is the mean of the accuracies against all the threat models not trained against; this measures how well robustness generalizes to other threat models.

CIFAR-10

We test ResNet-50s trained on the CIFAR-10 dataset with PAT and adversarial training (AT) against six attacks (see Table 2 ): L ∞ and L 2 AutoAttack, StAdv (Xiao et al., 2018) , ReCol- orAdv (Laidlaw and Feizi, 2019) , and PPGD and LPA. This allows us to determine if PAT gives robustness against a range of adversarial attacks. We experiment with using various models to calculate the LPIPS distance during PAT. We try using the same model both for classification and to calculate the LPIPS distance (self-bounded PAT). We also use AlexNet trained on CIFAR-10 prior to PAT (externally-bounded PAT). We find that PAT outperforms L p adversarial training and TRADES (Zhang et al., 2019a) , improving the union accuracy from <5% to >20%, and nearly doubling mean accuracy against unseen threat models from 26% to 49%. Surprisingly, we find that PAT even outperforms threat-specific AT against StAdv and ReColorAdv; see Appendix F.5 for more details. Ablation studies of PAT are presented in Appendix F.1.

ImageNet-100

We compare ResNet-50s trained on the ImageNet-100 dataset with PAT and adversarial training (Table 3 ). Classifiers are tested against seven attacks at the medium bound from the perceptual study (see Section 6 and Appendix Table 4 ). Self-and externally-bounded PAT give similar results. Both produce more than double the next highest union accuracy and also significantly increase the mean robustness against unseen threat models by around 15%. Perceptual attacks On both CIFAR-10 and ImageNet-100, we find that Perceptual PGD (PPGD) and Lagrangian Perceptual Attack (LPA) are the strongest attacks studied. LPA is the strongest, reducing the most robust classifier to 9.8% accuracy on CIFAR-10 and 2.4% accuracy on ImageNet-100. Also, models most robust to LPA in both cases are those that have the best union and unseen mean accuracies. This demonstrates the utility of evaluating against LPA as a proxy for adversarial robustness against a range of threat models. See Appendix E for further attack experiments. Comparison to other defenses against multiple attacks Besides the baseline of adversarially training against a single attack, we also compare PAT to adversarially training against multiple attacks (Tramer and Boneh, 2019; Maini et al., 2019) . We compare three methods for multiple-attack training: choosing a random attack at each training iteration, optimizing the average loss across all attacks, and optimizing the maximum loss across all attacks. The latter two methods are very expensive, increasing training time by a factor equal to the number of attacks trained against, so we only evaluate these methods on CIFAR-10. As in Tramer and Boneh (2019) , we find that the maximum loss strategy leads to the greatest union accuracy among the multiple-attack training methods. However, PAT performs even better on CIFAR-10, despite training against none of the attacks and taking one fourth of the time to train. The random strategy, which is the only feasible one on ImageNet-100, performs much worse than PAT. Even the best multiple-attack training strategies still fail to generalize to the unseen neural perceptual attacks, PPGD and LPA, achieving much lower accuracy than PAT. On CIFAR-10, we also compare PAT to manifold regularization (MR) (Jin and Rinard, 2020) , a non-adversarial training defense. MR gives union accuracy close to PAT-self, but much lower clean accuracy; for PAT-AlexNet, which gives similar clean accuracy to MR, the union accuracy is much higher. Threat model overlap In Figure 2 , we investigate how the sets of images vulnerable to L 2 , spatial, and perceptual attacks overlap. Nearly all adversarial examples vulnerable to L 2 or spatial attacks are also vulnerable to LPA. However, there is only partial overlap between the examples vulnerable to L 2 and spatial attacks. This helps explain why PAT results in improved robustness against spatial attacks (and other diverse threat models) compared to L 2 adversarial training. Why does PAT work better than L p adversarial training? In Figure 5 , we give further explanation of why PAT results in improved robustness against diverse threat models. We generate many adversarial examples for the L ∞ , L 2 , JPEG, StAdv, and ReColorAdv threat models and measure their distance from the corresponding natural inputs using L p distances and the neural perceptual distance, LPIPS. While L p distances vary widely, LPIPS gives remarkably comparable distances to different types of adversarial examples. Covering all threat models during L ∞ or L 2 adversarial training would require using a huge training bound, resulting in poor performance. In contrast, PAT can obtain robustness against all the narrow threat models at a reasonable training bound.

Robustness against common corruptions

In addition to evaluating PAT against adversarial examples, we also evaluate its robustness to random perturbations in the CIFAR-10-C and ImageNet-C datasets (Hendrycks and Dietterich, 2019) . We find that PAT gives increased robustness (lower relative mCE) against these corruptions compared to adversarial training; see Appendix G for details. We have presented attacks and defenses for the neural perceptual threat model (realized by the LPIPS distance) and shown that it closely approximates the true perceptual threat model, the set of all perturbations to natural inputs which fool a model but are imperceptible to humans. Our work provides a novel method for developing defenses against adversarial attacks that generalize to unforeseen threat models. Our proposed perceptual adversarial attacks and PAT could be extended to other vision algorithms, or even other domains such as audio and text.

APPENDIX A PERCEPTUAL ATTACK ALGORITHMS

A.1 PERCEPTUAL PGD Recall from Section 4 that Perceptual PGD (PPGD) consists of repeatedly applying two steps: a first-order step in LPIPS distance to maximize the loss, followed by a projection into the allowed set of inputs. Here, we focus on the first-order step; see Appendix A.4 for how we perform projection onto the LPIPS ball. We wish to solve the following constrained optimization for the step δ given the step size η and current input x: max δ L(f (x + δ), y) subject to Jδ 2 ≤ η (7) Let f (x) := L(f (x), y) for an input x ∈ X . Let J be the Jacobian of φ(•) at x and ∇ f be the gradient of f (•) at x. Lemma 1. The first-order approximation of ( 7) is max δ f (x) + (∇ f ) δ subject to Jδ 2 ≤ η, ( ) and can be solved in closed-form by δ * = η (J J) -1 (∇ f ) (J + ) (∇ f ) 2 . where J + is the pseudoinverse of J. Proof. We solve (8) using Lagrange multipliers. First, we take the gradient of the objective: ∇ δ f (x) + (∇ f ) δ = ∇ f We can rewrite the constraint by squaring both sides to obtain δ J Jδ -2 ≤ 0 Taking the gradient of the constraint gives ∇ δ δ J Jδ -2 = 2J Jδ Now, we set one gradient as a multiple of the other and solve for δ: J Jδ = λ(∇ f ) (9) δ = λ(J J) -1 (∇ f ) (10) Substituting into the constraint from (8) gives Jδ 2 = η Jλ(J J) -1 (∇ f ) 2 = η λ J(J J) -1 (∇ f ) 2 = η λ ((J J) -1 J ) (∇ f ) 2 = η λ (J + ) (∇ f ) 2 = η λ = η (J + ) (∇ f ) 2 We substitute this value of λ into (10) to obtain Solution with conjugate gradient method Calculating (11) directly is computationally intractable for most neural networks, since inverting J J and calculating the pseudoinverse of J are expensive. Instead, we approximate δ * by using the conjugate gradient method to solve the following linear system, based on (9): J Jδ = ∇ f (12) ∇ f is easy to calculate using backpropagation. The conjugate gradient method does not require calculating fully J J; instead, it only requires the ability to perform matrix-vector products J Jv for various vectors v. δ * = η (J J) -1 (∇ f ) (J + ) (∇ f ) 2 . ( ) Original input Adv. example LPIPS network x ∈ X φ( x) ∈ A Normalized activations f ( x) = f (x) "shopping basket" x ∈ X f (x) ∈ Y φ(x) ∈ A f φ LPIPS Predicted label LPIPS distance d(x, x) = φ(x) -φ( x) 2 "fiddler crab" f φ Classifier network We can approximate Jv using finite differences given a small, positive value h: Jv ≈ φ(x + hv) -φ(x) h Then, we can calculate J Jv by introducing an additional variable u and using autograd: ∇ u (φ(x + u)) Jv u=0 = dφ du (x + u) Jv + (φ(x + u)) d du Jv u=0 = dφ du (x + u) Jv + (φ(x + u)) 0 u=0 = dφ du (x) Jv = J Jv This allows us to efficiently approximate the solution of ( 12) to obtain (J J) -1 ∇ f . We use 5 iterations of the conjugate gradient algorithm in practice. From there, it easy to solve for λ, given that (J + ) ∇ f = J(J J) -1 ∇ f . Then, δ * can be calculated via (10). See Algorithm 1 for the full attack. Computational complexity PPGD's running time scales with the number of steps T and the number of conjugate gradient iterations K. It also depends on whether the attack is self-bounded (the same network is used for classification and the LPIPS distance) or externally-bounded (different networks are used). For each of the T steps, θ( x), ∇ x L(f ( x), y), and φ( x + hδ k ) must be calculated once (lines 4 and 15 in Algorithm 1). This takes 2 forward passes and 1 backward pass for the self-bounded case, and 3 forward passes and 1 backward pass for the externally-bounded case. In addition, J Jv needs to be calculated (in the MULTIPLYJACOBIAN routine) K + 1 times. Each calculation of J Jv requires 1 forward and 1 backward pass, assuming φ( x) is already calculated. Finally, the projection step takes n + 1 forward passes for n iterations of the bisection method (see Section A.4). In all, the algorithm requires T (K + n + 4) forward passes and T (K + n + 3) backward passes in the self-bounded case. In the externally-bounded case, it requires T (K + n + 5) forward passes and the same number of backward passes. Algorithm Perceptual PGD (PPGD) 1: PPGD(classifier f (•), LPIPS network φ(•), input x, label y, bound , step η) 2: x ← x + 0.01 * N (0, 1) initialize perturbations with random Gaussian noise 3: for t in 1, . . . , T do T is the number of steps 4: ∇ f ← ∇ x L(f ( x), y) 5: δ 0 ← 0 6: r 0 ← ∇ f -MULTIPLYJACOBIAN(φ, x, δ 0 ) 7: p 0 ← r 0 8: for k in 0, . . . , K -1 do conjugate gradient algorithm; we use K = 5 iterations 9: α k ← r k r k p k MULTIPLYJACOBIAN(φ, x,p k ) 10: δ k+1 ← δ k + α k p k 11: r k+1 ← r k -α k MULTIPLYJACOBIAN(φ, x, p k ) 12: β k ← r k+1 r k+1 r k r k 13: p k+1 ← r k+1 + β k p k 14: end for 15: m ← φ( x + hδ k ) -φ( x) /h m ≈ Jδ k for small h; we use h = 10 -3 16: x ← (η/m)δ k 17: x ← PROJECT(d, x, end for 19: return x 20: end procedure 21: 22: procedure MULTIPLYJACOBIAN(φ(•), x, v) calculates J Jv; J is the Jacobian of φ at x

23:

Jv ← (φ( x + hv) -φ( x))/h h is a small positive value; we use h = 10 -3 24: J Jv ← ∇ u φ( x + u) Jv u=0 25: return J Jv 26: end procedure A.2 LAGRANGIAN PERCEPTUAL ATTACK (LPA) Our second attack, Lagrangian Perceptual Attack (LPA), optimizes a Lagrangian relaxation of the perceptual attack problem (3): max x L(f ( x), y) -λ max 0, φ( x) -φ(x) 2 -. To optimize (13), we use a variation of gradient descent over x, starting at x with a small amount of noise added. We perform our modified version of gradient descent for T steps. We use a step size η, which begins at and decays exponentially to /10. At each step, we begin by taking the gradient of (13) with respect to x; let ∆ refer to this gradient. Then, we normalize ∆ to have L 2 norm 1, i.e. ∆ = ∆/ ∆ 2 . We wish to take a step in the direction of ∆ of size η in LPIPS distance. If we wanted to take a step of size η in L 2 distance, we could just take the step η ∆. However, taking a step of particular size in LPIPS distance is harder. We assume that the LPIPS distance is approximately linear in the direction ∆. We can approximate the directional derivative of the LPIPS distance in the direction ∆ using finite differences: d dα d( x, x + α ∆) ≈ d( x, x + h∆) h = m. Here, h is a small positive value, and we assign the approximation of the directional derivative to m. Now, we can write the first-order Taylor expansion of the perceptual distance torwards the direction ∆ as follows: d( x, x + α ∆) ≈ d( x, x) + mα = mα. we want to take a step of size η. Plugging in and solving, we obtain η = d( x, x + α ∆) ≈ mα η ≈ mα η/m ≈ α. the approximate step should take is (η/m) ∆. We take this step at each of the T iterations of our modified gradient descent method. We begin with λ = 10 -2 . After performing gradient descent, if d(x, x) > (i.e. the adversarial example is outside the constraint) we increase λ by a factor of 10 and repeat the optimization. We repeat this entire process five times, meaning we search over λ ∈ {10 -2 , 10 -1 , 10 0 , 10 1 , 10 2 }. Finally, if the resulting adversarial example is still outside the constraint, we project it into the threat model; see Appendix 5. Computational complexity LPA's running time scales with the number of iterations S used to search for λ as well as the number of gradient descent steps T . φ(x) may be calculated once during the entire attack, which speeds it up. Then, each step of gradient descent requires 2 forward and 1 backward passes in the self-bounded case, and 3 forward and 2 backward passes in the externallybounded case. The projection at the end of the attack requires n + 1 forward passes for n iterations of the bisection method (see Section A.4). In total, the attack requires 2ST + n + 2 forward passes and ST + n + 2 backward passes in the self-bounded case, and 3ST + n + 2 forward passes and 2ST + n + 2 backward passes in the externally-bounded case. Algorithm 2 Lagrangian Perceptual Attack (LPA) 1: procedure LPA(classifier network f (•), LPIPS distance d(•, •), input x, label y, bound ) 2: λ ← 0.01 3: x ← x + 0.01 * N (0, 1) initialize perturbations with random Gaussian noise 4: for i in 1, . . . , S do we use S = 5 iterations to search for the best value of λ 5: for t in 1, . . . , T do T is the number of steps 6: ∆ ← ∇ x L(f ( x), y) -λ max 0, d( x, x -) take the gradient of (5) end for 16: x ← PROJECT(d, x, x, ) 17: return x 18: end procedure

A.3 FAST LAGRANGIAN PERCEPTUAL ATTACK

We use the Fast Lagrangian Perceptual Attack (Fast-LPA) for Perceptual Adversarial Training (PAT, see Section 5). Fast-LPA is similar to LPA (Appendix A.2), with two major differences. First, Fast-LPA does not search over λ values; instead, during the T gradient descent steps, λ is increased exponentially from 1 to 10. Second, we remove the projection step at the end of the attack. This means that Fast-LPA may produce adversarial examples outside the threat model. This means that Fast-LPA cannot be used for evaluation, but it is fine for training. Computational complexity Fast-LPA's running time can be calculated similarly to LPA's (see Section A.2), except that S = 1 and there is no projection step. Let T be the number of steps taken during the attack. Then Fast-LPA requires 2T + 1 forward passes and T + 1 backward passes the self-bounded case, 3T + 1 forward passes and 2T + 1 backward passes for the externally-bounded case. In comparison, PGD with T iterations requires T forward passes and T backward passes. Thus, Fast-LPA is slightly slower, requiring T + 1 more forward passes and no more backward passes. Algorithm 3 Fast Lagrangian Perceptual Attack (Fast-LPA) 1: procedure FASTLPA(classifier network f (•), LPIPS distance d(•, •), input x, label y, bound ) 2: x ← x + 0.01 * N (0, 1) initialize perturbations with random Gaussian noise 3: for t in 1, . . . , T do T is the number of steps x ← x + (η/m) ∆ take a step of size η in LPIPS distance 10: end for 11: return x 12: end procedure

A.4 PERCEPTUAL PROJECTION

We explored two methods of projecting adversarial examples into the LPIPS thread model. The method we use throughout the paper is based on Newton's method and is shown in algorithm 4. However, we also experimented with the bisection root finding method, shown in Algorithm 5 (also see Appendix E). In general, given an adversarial example x, original input x, and LPIPS bound , we wish to find a projection x of x such that d( x , x) ≤ . Assume for this section that d( x, x) > , i.e. the current adversarial example x is outside the bound. If d( x, x) ≤ , then we can just let x = x and be done.

Newton's method

The second projection method we explored uses the generalized Newton-Raphson method to attempt to find the closest projection x to the current adversarial example x such that the projection is within the threat model, i.e. d( x , x) ≤ . To find such a projection, we again define a function r(•) and look for its roots: r( x ) = d( x , x) -. If we can find a projection x close to x such that r( x ) ≤ 0, then this projection will be contained within the threat model, since r( x ) ≤ 0 ⇒ d( x , x) ≤ . To find such a root, we use the generalized Newton-Raphson method, an iterative algorithm. Beginning with x 0 = x, we update x iteratively using the step x i+1 = x i -∇r( x i ) + r( x i ) + s , where A + denotes the pseudoinverse of A, and s is a small positive constant (the "overshoot"), which helps the algorithm converge. We continue this process until r( x t ) ≤ 0, at which point the projection is complete. This algorithm usually takes 2-3 steps to converge with s = 10 -2 . Each step requires 1 forward and 1 backward pass to calculate r( x t ) and its gradient. The method also requires 1 forward pass at the beginning to calculate φ(x). Adversarial attacks Much of the initial work on adversarial robustness focused on perturbations to natural images which were bounded by the L 2 or L ∞ distance (Carlini and Wagner, 2017; Goodfellow et al., 2015; Madry et al., 2018) . However, recently the community has discovered many other types of perturbations that are imperceptible and can be optimized to fool a classifier, but are outside L p threat models. These include spatial perturbations using flow fields (Xiao et al., 2018) , translation and rotation (Engstrom et al., 2017) , and Wassterstein distance bounds (Wong et al., 2019) . Attacks that manipulate the colors in images uniformly also been proposed (Hosseini and Poovendran, 2018; Hosseini et al., 2017; Zhang et al., 2019b) and have been generalized into "functional adversarial attacks" by Laidlaw and Feizi (2019) . 

E PERCEPTUAL ATTACK EXPERIMENTS

We experiment with variations of the two validation attacks, PPGD and LPA, described in Section 4. As described in Appendix A.4, we developed two methods for projecting candidate adversarial examples into the LPIPS ball surrounding a natural input. We attack a single model using PPGD and LPA with both projection methods. We also compare self-bounded to externally-bounded attacks. We find that LPA tends to be more powerful than PPGD. Finally, we note that externally-bounded LPA is extremely powerful, reducing the accuracy of a PAT-trained classifier on ImageNet-100 to just 2.4%. Besides these experiments, we always use externally-bounded attacks with AlexNet for evaluation. AlexNet correlates with human perception of adversarial examples (Figure 6 ) and provides a standard measure of LPIPS distance; in contrast, self-bounded attacks by definition have varying bounds across evaluated models. for all the test threat models on CIFAR-10 (L 2 , L ∞ , StAdv, and ReColorAdv). We find that the average LPIPS distance for all adversarial examples using AlexNet is 1.13; for a PAT-trained ResNet-50, it is 0.88. Because of this disparity, we use a lower training bound for self-bounded PAT ( = 0.5) than for AlexNet-bounded PAT ( = 1). However, this means that the average test attack has 76% greater LPIPS distance than the training attacks for self-bounded PAT, whereas the average test attack has only 13% greater LPIPS distance for AlexNet-bounded PAT. This explains why AlexNet-bounded PAT gives better robustness; it only has to generalize to slightly larger attacks on average. We tried performing AlexNet-bounded PAT with a more comparable bound ( = 0.7) to self-bounded PAT. This gives the average test attack about 80% greater LPIPS distance than the training attacks, similar to self-bounded PAT. Table 8 shows that the results are more similar for self-bounded and AlexNet-bounded PAT with = 0.7. 

G COMMON CORRUPTIONS EVALUATION

We evaluate the robustness of PAT-trained models to common corruptions in addition to adversarial examples on CIFAR-10 and ImageNet-100. In particular, we test PAT-trained classifiers on CIFAR-10-C and ImageNet-100-C, where ImageNet-100-C is the 100-class subset of ImageNet-C formed by taking every tenth class (Hendrycks and Dietterich, 2019) . These datasets are based on random corruptions of CIFAR-10 and ImageNet, respectively, using 15 perturbation types with 5 levels of severity. The perturbation types are split into four general categories: "noise," "blur," "weather," and "digital." The metric we use to evaluate PAT against common corruptions is mean relative corruption error (relative mCE). The relative corruption error is defined by Hendrycks and Dietterich (2019) 10 and 11 . PAT gives better robustness (lower relative mCE) against common corruptions on both CIFAR-10-C and ImageNet-100-C. The only category of perturbations where L 2 adversarial training outperforms PAT is "noise" on CIFAR-10-C, which makes sense because Gaussian and other types of noise are symmetrically distributed in an L 2 ball. For the other perturbation types and on ImageNet-100-C, PAT outperforms L 2 and L ∞ adversarial training, indicating that robustness against a wider range of worst-case perturbations also gives robustness against a wider range of random perturbations. Table 12 : Hyperparameters for the adversarial training experiments on CIFAR-10 and ImageNet-100. For CIFAR-10, hyperparameters are similar to those used by Zhang et al. (2019a) . For ImageNet-100, hyperparameters are similar to those used by Kang et al. (2019) Calculating the LPIPS distance using a neural network classifier g(•) requires choosing layers whose normalized, flattened activations φ(•) should be compared between images. For AlexNet and VGG-16, we use the same layers to calculate LPIPS distance as do Zhang et al. (2018) . For AlexNet (Krizhevsky et al., 2012) , we use the activations after each of the first five ReLU functions. For VGG-16 (Simonyan and Zisserman, 2014) , we use the activations directly before the five max pooling layers. In ResNet-50, we use the outputs of the conv2_x, conv3_x, conv4_x, and conv5_x layers, as listed in Table 1 of He et al. (2016) .



Code and data can be downloaded at https://github.com/cassidylaidlaw/perceptual-advex.



Figure 3: Adversarial examples generated using self-bounded and externally-bounded LPA perceptual adversarial attack (Section 4) with a large bound. Original images are shown in the left column and magnified differences from the original are shown to the right of the examples. See also Figure 7.

Figure 4: Results of the perceptual study described in Section 6 across five narrow threat models and our two perceptual attacks, each with three bounds. (a) The perceptibility of adversarial examples correlates well with the LPIPS distance (based on AlexNet) from the natural example. (b) The Lagrangian Perceptual Attack (LPA) and Perceptual PGD (PPGD) are strongest at a given perceptibility. Strength is the attack success rate against an adversarially trained classifier. (c) Correlation between the perceptibility of attacks and various distance measures: L 2 , SSIM(Wang et al., 2004), and LPIPS(Zhang et al., 2018) calculated using various architectures, trained and at initialization.

We generate samples via different adversarial attacks using narrow threat models in ImageNet-100 and measure their distances from natural inputs using L p and LPIPS metrics. The distribution of distances for each metric and threat model is shown as a violin plot. (a-b) L p metrics assign vastly different distances across perturbation types, making it impossible to train against all of them using L p adversarial training. (c-d) LPIPS assigns similar distances to similarly perceptible attacks, so a single training method, PAT, can give good robustness across different threat models.

Figure 6: Creating an adversarial example in the LPIPS threat model.

(0.1) t/T the step size η decays exponentially 9:m ← d( x, x + h ∆)/h m ≈ derivative of d( x,•) in the direction of ∆; h = 0.1 10:x ← x + (η/m) ∆ take a step of size η in LPIPS distance 11:

x L(f ( x), y) -λ max 0, d( x, x -) take the gradient of ((0.1) t/T the step size η decays exponentially 8:m ← d( x, x + h ∆)/h m ≈ derivative of d( x,•) in the direction of ∆; h = 0.1 9:

Figure 7: Adversarial examples generated using self-bounded and externally-bounded PPGD and LPA perceptual adversarial attacks (Section 4) with a large bound. Original images are shown in the left column and magnified differences from the original are shown to the right of the examples.

Figure 8: Several classifiers trained with PAT, adversarial training, and normal training on CIFAR-10 and ImageNet-100 are plotted with their clean accuracy and accuracy against the union of narrow threat (see Section 7 for robustness evaluation methodology). PAT models on both datasets outperform adversarial trained models in both clean and robust accuracy.

is the error of classifier f against corruption type c at severity level s, and E f clean is the error of classifier f on unperturbed inputs. The relative mCE is defined as the mean relative CE over all perturbation types.The relative mCE for classifiers trained with normal training, adversarial training, and PAT is shown in Tables



Accuracies against various attacks for models trained with adversarial training and Perceptual Adversarial Training (PAT) variants on CIFAR-10. Attack bounds are 8/255 for L ∞ , 1 for L 2 , 0.5 for PPGD/LPA (bounded with AlexNet), and the original bounds for StAdv/ReColorAdv. Manifold regularization is fromJin and Rinard (2020). See text for explanation of all terms.

Comparison of adversarial training against narrow threat models and Perceptual Adversarial Training (PAT) on ImageNet-100. Accuracies are shown against seven attacks with the medium bounds from Table4. PAT greatly improves accuracy (33% vs 12%) against the union of the narrow threat models despite not training against any of them. See text for explanation of all terms.

Bounds and results from the perceptual study. Each threat model was evaluated with a small, medium, and large bound. Bounds for L 2 , L ∞ , and JPEG attacks (first three rows) are given assuming input image is in the range [0, 255]. Perceptibility (perc.) is the proportion of natural input-adversarial example pairs annotated as "different" by participants. Strength (str.) is the success rate when attacking a classifier adversarially trained against that threat model (higher is stronger). Perceptual attacks (PPGD and LPA, see Section 4) are externally bounded with AlexNet. All experiments on ImageNet-100.

Accuracy of a PAT-trained ResNet-50 on ImageNet-100 against various perceptual adversarial attacks. PPGD and LPA attacks are shown self-bounded and externally-bounded with AlexNet. We also experimented with two different perceptual projection methods (see Appendix A.4). Bounds are = 0.25 for self-bounded attacks and = 0.5 for externally-bounded attacks, since the LPIPS distance from AlexNet tends to be about twice as high as that from ResNet-50.

Accuracies against various attacks for classifiers on CIFAR-10 trained with self-and AlexNet-bounded PAT using various bounds.Tsipras et al. (2019) have noted that there is often a tradeoff the between adversarial robustness of a classifier and its accuracy. That is, models which have higher accuracy under adversarial attack may have lower accuracy against clean images. We observe this phenomenon with adversarial training and PAT. Since PAT gives greater robustness against several narrow threat models, models trained with it tend to have lower accuracy on clean images than models trained with narrow adversarial training. In Figure8, we show the robust and clean accuracies of several models trained on CIFAR-10 and ImageNet-100 with PAT and adversarial training. While some PAT models have lower clean accuracy than adversarially trained models, at least one PAT model on each dataset surpasses the Pareto frontier of the accuracy-robustness tradeoff for adversarial training. That is, there are PAT-trained models on both datasets with both higher robust accuracy and higher clean accuracy than adversarial training.F.5 PERFORMANCE AGAINST STADV AND RECOLORADVIt was surprising to find that PAT outperformed threat-specific adversarial training (AT) against the StAdv and ReColorAdv attacks on CIFAR-10 (it does not do so on ImageNet-100). In Table2(partially reproduced in Table9below), PAT-AlexNet improves robustness over AT against StAdv from 54% to 65%; PAT-self improves robustness over AT against ReColorAdv from 65% to 71%.We conjecture that, for these threat models, this is because training against a wider set of perturbations at training time helps generalize robustness to new inputs at test time, even within the same threat model. To test this, we additionally train classifiers using adversarial training against the StAdv and ReColorAdv attacks with double the default bound. The results are shown in Table9below. We find that, because these classifiers are exposed to a wider range of spatial and recoloring perturbations during training, they perform better than PAT against those attacks at test time (76% vs 65% for StAdv and 81% vs 71% for ReColorAdv). This suggests that PAT not only improves robustness against a wide range of adversarial threat models, it can actually improve robustness over threat-specific adversarial training by incorporating a wider range of attacks during training.

Results of our experiments training against StAdv and ReColorAdv on CIFAR-10 with double the default bounds. Columns are identical to Table 2.

.

ACKNOWLEDGMENTS

This project was supported in part by NSF CAREER AWARD 1942230, HR 00111990077, HR00112090132, HR001119S0026, NIST 60NANB20D134, AWS Machine Learning Research Award and Simons Fellowship on "Foundations of Deep Learning."

annex

Algorithm 4 Perceptual Projection (Newton's Method) procedure PROJECT(LPIPS distance d(•, •), adversarial example x, original input x, bound )x 0 for i in 0, . . . do r( x i ) ← d( x i , x)if r( x i ) ≤ 0 then x i end x i+1 = x i -∇r( x i ) + r( x i ) + s s is the "overshoot"; we use s = 10 -2 end for end procedure Bisection method The first projection method we explored (and the one we use throughout the paper) attempts to find a projection x along the line connecting the current adversarial example x and original input x. Let δ = xx. Then we can represent our final projection x as a point between x and x as-. This function has the following properties:We use the bisection root finding method to find a root α * of r(•) on the interval [0, 1], which exists since r(•) is continuous and because of items 1 and 2 above. By item 3, at this root, the projected adversarial example is within the threat model:We use n = 10 iterations of the bisection method to calculate α * . This requires n + 1 forward passes through the LPIPS network, since φ(x) must be calculated once, and φ(x + αδ) must be calculated n times. See Algorithm 5 for the full projection algorithm.Algorithm 5 Perceptual Projection (Bisection Method)

B ADDITIONAL RELATED WORK

Here, we expand on the related work discussed in Section 2 discuss some additional existing work on adversarial robustness.

F PAT EXPERIMENTS F.1 ABLATION STUDY

We perform an ablation study of Perceptual Adversarial Training (PAT). First, we examine Fast-LPA, the training attack. We attempt training without step size (η) decay and/or without increasing λ during Fast-LPA, and find that PAT performs best with both η decay and λ increase.Training a classifier with PAT gives robustness against a wide range of adversarial threat models (see Section 7). However, it tends to give low accuracy against natural, unperturbed inputs. Thus, we use a technique from Balaji et al. (2019) to improve natural accuracy in PAT-trained models: at each training step, only inputs which are classified correctly without any perturbation are attacked. In addition to increasing natural accuracy, this also improves the speed of PAT since only some inputs from each batch must be attacked. In this ablation study, we compare attacking every input with Fast-LPA during training to only attacking the natural inputs which are already classified correctly. We find that the latter method achieves higher natural accuracy at the cost of some robust accuracy. We choose not to add a projection step to the end of Fast-LPA during training because it slows down the attack, requiring many more passes through the network per training step. However, we tested self-bounded PAT with a projection step and found that it increased clean accuracy slightly but decreased robust accuracy significantly. We believe this is because not projecting increases the effective bound on the training attacks, leading to better robustness. To test this, we tried training without projection using a smaller bound ( = 0.4 instead of = 0.5) and found the results closely matched the results when using projection at the larger bound. That is, PAT with projection at = 0.5 is similar to PAT without projection at = 0.4. These results are shown in 7. 

H EXPERIMENT DETAILS

For all experiments, we train ResNet-50 (He et al., 2016) with SGD for 100 epochs. We use 10 attack iterations for training and 200 for testing, except for PPGD and LPA, where we use 40 for testing since they are more expensive. Self-bounded PAT takes about 12 hours to train for CIFAR-10 on an Nvidia RTX 2080 Ti GPU, and about 5 days to train for ImageNet-100 on 4 GPUs. We implement PPGD, LPA, and PAT using PyTorch (Paszke et al., 2017) .We preprocess images after adversarial perturbation, but before classification, by standardizing them based on the mean and standard deviation of each channel for all images in the dataset. We use the default data augmentation techniques from the robustness library (Engstrom et al., 2019) . The CIFAR-10 dataset can be obtained from https://www.cs.toronto.edu/~kriz/cifar. html. The ImageNet-100 dataset is a subset of the ImageNet Large Scale Visual Recognition Challenge (2012) (Russakovsky et al., 2015) including only every tenth class by WordNet ID order. It can be obtained from http://www.image-net.org/download-images.

