PERTURBATION TYPE CATEGORIZATION FOR MULTI-PLE p BOUNDED ADVERSARIAL ROBUSTNESS

Abstract

Despite the recent advances in adversarial training based defenses, deep neural networks are still vulnerable to adversarial attacks outside the perturbation type they are trained to be robust against. Recent works have proposed defenses to improve the robustness of a single model against the union of multiple perturbation types. However, when evaluating the model against each individual attack, these methods still suffer significant trade-offs compared to the ones specifically trained to be robust against that perturbation type. In this work, we introduce the problem of categorizing adversarial examples based on their p perturbation types. Based on our analysis, we propose PROTECTOR, a two-stage pipeline to improve the robustness against multiple perturbation types. Instead of training a single predictor, PROTECTOR first categorizes the perturbation type of the input, and then utilizes a predictor specifically trained against the predicted perturbation type to make the final prediction. We first theoretically show that adversarial examples created by different perturbation types constitute different distributions, which makes it possible to distinguish them. Further, we show that at test time the adversary faces a natural trade-off between fooling the perturbation type classifier and the succeeding predictor optimized with perturbation specific adversarial training. This makes it challenging for an adversary to plant strong attacks against the whole pipeline. In addition, we demonstrate the realization of this trade-off in deep networks by adding random noise to the model input at test time, enabling enhanced robustness against strong adaptive attacks. Extensive experiments on MNIST and CIFAR-10 show that PROTECTOR outperforms prior adversarial training based defenses by over 5%, when tested against the union of 1 , 2 , ∞ attacks. 1 

1. INTRODUCTION

There has been a long line of work studying the vulnerabilities of machine learning models to small changes in the input data. In particular, most existing works focus on p bounded perturbations (Szegedy et al., 2013; Goodfellow et al., 2015) . While majority of the prior work aims at achieving robustness against a single perturbation type (Madry et al., 2018; Kurakin et al., 2017; Tramèr et al., 2018; Dong et al., 2018; Zhang et al., 2019; Carmon et al., 2019) , real-world deployment of machine learning models requires them to be robust against various imperceptible changes in the input, irrespective of the attack type. Prior work has shown that when models are trained to be robust against one perturbation type, such robustness typically does not transfer to attacks of a different type (Schott et al., 2018; Kang et al., 2019) . As a result, recent works have proposed to develop models that are robust against the union of multiple perturbation types (Tramèr & Boneh, 2019; Maini et al., 2020) . Specifically, these works consider adversaries limited by their p distance from the original input for p ∈ {1, 2, ∞}. While these methods improve the overall robustness against multiple perturbation types, when evaluating the robustness against each individual perturbation type, the robustness of models trained by these methods is still considerably worse than those trained on a single perturbation type. Further, these methods are found sensitive to small changes in hyperparameters. In this work, we propose an alternative view that does not require a single predictor to be robust against a union of perturbation types. Instead, we propose to utilize a union of predictors to improve the overall robustness, where each predictor is specialized to defend against certain perturbation types. In particular, we introduce the problem of categorizing adversarial examples based on their perturbation types. Based on this idea, we propose PROTECTOR, a two-stage pipeline that performs Perturbation Type Categorization for Robustness against multiple perturbations. Specifically, first a perturbation type classifier predicts the type of the attack. Then, among the second-level predictors, PROTECTOR selects the one that is the most robust to the predicted perturbation type to make final prediction. We validate our approach from both theoretical and empirical aspects. First, we present theoretical analysis to show that for benign samples with the same ground truth label, their distributions become highly distinct when added with different types of perturbations, and thus can be separated. Further, we show that there exists a natural tension between attacking the top-level perturbation classifier and the second-level predictors -strong attacks against the second-level predictors make it easier for the perturbation classifier to predict the adversarial perturbation type, and fooling the perturbation classifier requires planting weaker (or less representative) attacks against the second-level predictors. As a result, even an imperfect perturbation classifier is sufficient to significantly improve the overall robustness of the model to multiple perturbation types. Empirically, we show that the perturbation type classifier generalizes well on classifying adversarial examples against different adversarially trained models. Then we further compare PROTECTOR to the state-of-the-art defenses against multiple perturbations on MNIST and CIFAR-10. PROTECTOR outperforms prior approaches by over 5% against the union of the 1 , 2 and ∞ attacks. While past work has focused on the worst case metric against all attacks, on average they suffer significant tradeoffs against individual attacks. From the suite of 25 different attacks tested, the average improvement for PROTECTOR over all the attacks w.r.t. the state-of-art baseline defense is ∼ 15% on both MNIST and CIFAR10. In particular, by adding random noise to the model input at test time, we further increase the tension between attacking top-level and second-level components, and bring in additional improvement of robustness against adaptive attackers. Additionally, PROTECTOR provides a modular way to integrate and update defenses against a single perturbation type.

2. RELATED WORK

Adversarial examples. The realization of the existence of adversarial examples in deep neural networks has spun active research on attack algorithms and defense proposals (Szegedy et al., 2013) . Among different types of attacks (Madry et al., 2018; Hendrycks et al., 2019; Hendrycks & Dietterich, 2019; Bhattad et al., 2020) , the most commonly studied ones constrain the adversarial perturbation within an p region of radius p around the original input. To improve the model robustness in the presence of such adversaries, the majority of existing defenses utilize adversarial training (Goodfellow et al., 2015) , which augments the training dataset with adversarial images. Till date, different variants of the original adversarial training algorithm remain the most successful defenses against adversarial attacks (Carmon et al., 2019; Zhang et al., 2019; Wong et al., 2020; Rice et al., 2020) . Other types of defenses include input transformation (Guo et al., 2018; Buckman et al., 2018) and network distillation (Papernot et al., 2016) , but were rendered ineffective under stronger adversaries (He et al., 2017; Carlini & Wagner, 2017a; Athalye et al., 2018; Tramer et al., 2020) . Other works have explored the relation between randomizing the inputs and adversarial examples. Tabacof & Valle (2016) analyzed the change in adversarial robustness with varying levels of noise. Hu et al. (2019) evaluated the robustness of a data point to random noise to detect adversarial examples, whereas Cohen et al. (2019) utilized randomized smoothing for certified robustness to adversarial attacks. Defenses against multiple perturbation types. Recent research has been drawn towards the goal of universal adversarial robustness. Since p -norm bounded attacks are amongst the strongest attacks in adversarial examples literature, defending against a union of such attacks is an important step towards this end goal. Schott et al. (2018) ; Kang et al. (2019) showed that models that were trained for a given p -norm bounded attacks are not robust against attacks in a different q region. Succeeding work has aimed at developing one single model that is robust against the union of multiple perturbation types. Schott et al. (2018) proposed the use of multiple variational autoencoders to achieve robustness to multiple p attacks on the MNIST dataset. Tramèr & Boneh (2019) used simple aggregations of multiple adversaries to achieve non-trivial robust accuracy against the union of the 1 , 2 , ∞ regions. Maini et al. (2020) proposed the MSD algorithm that takes gradient steps in the union of multiple p regions to improve multiple perturbation robustness. In a related line of work, Croce & Hein (2020a) proposed a method for provable robustness against all p regions for p ≥ 1. Instead of presenting empirical results, they study the upper and lower bounds of certified robust test error on much smaller perturbation radii. Therefore, their work has a different focus, and is not directly comparable to empirical defenses studied in our work. Detection of adversarial examples. Multiple prior works have focused on detecting adversarial examples (Feinman et al., 2017; Lee et al., 2018; Ma et al., 2018; Cennamo et al., 2020; Fidel et al., 2019; Yin et al., 2019a; b) . However, most of these defenses have been shown to be vulnerable in the presence of adaptive adversaries (Carlini & Wagner, 2017a; Tramer et al., 2020) . In comparison, our work focuses on a more challenging problem of categorizing different perturbations types. However, we show that by establishing a trade-off between fooling the perturbation classifier and the individual p -robust models, even an imperfect perturbation classifier is sufficient to make our pipeline robust.

3. PROTECTOR: PERTURBATION TYPE CATEGORIZATION FOR ROBUSTNESS

In this section, we discuss our proposed PROTECTOR approach, which performs perturbation type categorization to improve the model robustness against multiple perturbation types. We first illustrate the PROTECTOR pipeline in Figure 1 , then discuss the details of each component. At a high level, PROTECTOR performs the classification task as a two-stage process. Given an input x, PROTEC-TOR first utilizes a perturbation classifier C adv to predict its adversarial perturbation type. Then, based on the p attack type predicted by C adv , PROTECTOR uses the corresponding second-level predictor M p to provide the final prediction, where M p is specially trained to be robust against the p attack. Formally, let f θ be the PROTECTOR model, then the final prediction is: f θ (x) = M p (x); s.t. p = argmax C adv (x) (1) Note that when the input is a benign image, it could be classified as any perturbation type by C adv , since all secondlevel predictors should achieve a high test accuracy on benign images. As shown in Figure 1 , although we consider the robustness against three attack types, i.e., 1 , 2 , ∞ perturbations, unless otherwise specified, our perturbation classifier performs binary classification between p = {{1, 2}, ∞}. As will be discussed in Section 6, using two second-level predictors achieves better overall robustness than using three second-level predictors. We hypothesize that compared to the ∞ adversarial examples, 1 and 2 attacks are harder to separate, especially when facing an adaptive adversary which aims to attack the entire pipeline. To provide an intuitive illustration, we randomly sample 10K adversarial examples generated with PGD attacks on MNIST, and visualize the results of the Principal Component Analysis (PCA) in Figure 2 . We observe that the first two principal components for 1 and 2 adversarial examples are largely overlapping, while those for ∞ are clearly from a different distribution. Note that this simple visualization by no means suggest that 1 and 2 adversarial examples are not separable, it merely serves as a motivation.

4. THEORETICAL ANALYSIS

In this section, we provide a theoretical justification of our PROTECTOR framework design. First, we formally illustrate the setup of robust classification against multiple p perturbation types, and we consider models trained for a binary classification task. Based on this problem setting, in Theorem 1, we show the existence of a classifier that can separate adversarial examples belonging to different perturbation types. Moreover, in Theorem 2, we show that our PROTECTOR framework naturally offers a trade-off between fooling the perturbation classifier C adv and the individual robust models M p , thus it is extremely difficult for adversaries to stage attacks against the entire pipeline. Note that we focus on the simplified binary classification task for the convenience of theoretical analysis, but our PROTECTOR framework could improve the robustness of models trained on real-world image classification benchmarks as well, and we will discuss the empirical examination in Section 6.

4.1. PROBLEM SETTING

Data distribution. We consider a dataset of inputs sampled from the union of two multi-variate Gaussian distributions D, such that the input-label pairs (x,y) can be described as: y u.a.r ∼ {-1, +1}; x 0 ∼N (yα, σ 2 ), x 1 , . . . , x d i.i.d ∼ N (yη, σ 2 ) where x = [x 0 , x 1 , . . . , x d ] ∈ R d+1 and η = α √ d , such that the absolute value of the mean for any dimension is equal for inputs sampled from both the positive and the negative labels. This setting demonstrates the distinction between a feature x 0 that is strongly correlated with the input label, and d weakly correlated features that are (independently) normally distributed with mean yη and variance σ 2 . For the purposes of this work, we assume that α σ > 10 (x 0 is strongly correlated) and d > 100 (remaining d features are weakly correlated, but together represent a strongly correlated feature). We adapt this problem setting from Ilyas et al. (2019) , where they used a stochastic feature x 0 = y with probability p, as opposed to a normally distributed input feature as in our case. Our results hold in their setting as well. However, our setting better represents the true data distribution, where input features are seldom stochastically flipped. More discussion could be found in Appendix A. Perturbation types. We focus our discussion on adversaries constrained within a fixed p region of radius p around the original input, for p ∈ S = {1, 2, ∞}. Such adversaries are frequently studied in existing work, primarily for finding the optimal first-order adversaries for different perturbation types. We define ∆ p, as the p threat model of radius and ∆ S = p∈S ∆ p, . For a model f θ parametrized over θ, the objective of the adversary is to find the optimal perturbation δ * , such that: δ * = arg max δ∈∆ S (f θ (x + δ), y) where (•, •) is the cross-entropy loss. Based on the model design in Section 3, we focus on discussing the separation of 1 and ∞ in the following theorems, but our proofs could also naturally be adapted to analyze the separability of other perturbation types.

4.2. SEPARABILITY OF ADVERSARIAL PERTURBATIONS

Consider a standard classifier M trained with the objective of correctly classifying the label of inputs x ∈ D. Since the original distribution of the input data for each label is known to us, we first aim to examine how adversaries confined within different perturbation regions modify the input. The goal of the adversary is to fool the label predictor M , by finding the optimal perturbation δ p ∀p ∈ S. The theorem below shows that the distributions of adversarial inputs within different p regions can be separated with a high accuracy, and we present the formal proof in Appendix B. Theorem 1 (Separability of perturbation types). Given a binary Gaussian classifier M trained on D, consider D y p to be the distribution of optimal adversarial inputs (for a class y) against M , within p regions of radius p , where 1 = α, ∞ = α/ √ d. Distributions D y p (p ∈ {1, ∞}) can be accurately separated by a binary Gaussian classifier C adv with a misclassification probability P e ≤ 10 -24 . The proof sketch is as follows. We first calculate the optimal weights of a binary Gaussian classifier M trained on D. Accordingly, for any input x ∈ D, we find the optimal adversarial perturbation δ p ∀p ∈ {1, ∞} against M . We discuss how these perturbed inputs x + δ p also follow a normal distribution, with shifted means. Finally, for data points belonging to a given classification label, we show that C adv is able to predict the correct perturbation type with a very low classification error. We present the formal proof in Appendix B.

4.3. ADVERSARIAL TRADE-OFF

In Section 4.2, we showed that the optimal perturbations corresponding to different perturbation types belong to distinct data distributions, and it is fairly easy to separate them using a simple classifier. However, in the white-box setting, the adversary has knowledge of both the perturbation classifier C adv and specialized robust models M p at test time. Therefore, the adversary can adapt the attack to fool the entire pipeline, instead of individual models M p alone. Note that there are some overlapping regions among different p perturbation regions. For example, every adversary could set δ p = 0 as a valid perturbation, and thus it is clearly not possible for the attack classifier C adv to correctly classify the attack (∀p ∈ {1, 2, ∞}) in such a scenario. However, such perturbation is not useful, because all the base models can correctly classify unperturbed inputs with a high probability. In the following theorem, we examine the robustness of our PROTECTOR pipeline in the presence of such strong dynamic adversaries. Theorem 2 (Adversarial trade-off). Given a data distribution D, adversarially trained models M p, p , and an attack classifier C adv that distinguishes perturbations of different p attack types for p ∈ {1, ∞}, The probability of successful attack for the worst-case adversary over the entire PROTECTOR pipeline is P e < 0.01 for 1 = α + 2σ and ∞ = α+2σ √ d . Here, the worst-case adversary refers to an adaptive adversary that has full knowledge of the defense strategy, and makes the strongest adversarial decision given the perturbation constraints. In Appendix C.2, we discuss how 1 , ∞ are set so that the 1 and ∞ adversaries can fool M ∞, ∞ and M 1, 1 models respectively with a high success rate. Our proof sketch is as follows. We first show that when trained on D, an adversarially robust model M p can achieve robust accuracy of greater than 99% against the attack type it was trained for. On the contrary, when subjected to attacks outside the trained perturbation region, such robust accuracy reduces to under 2%. Then, we analyze the modified distributions of the perturbed inputs by different p attacks. Based on this analysis, we construct a simple decision rule for the perturbation classifier C adv . Finally, we compute the perturbation induced by the worst-case adversary. We show that there exists a trade-off between fooling the perturbation classifier C adv (to allow the alternate M p, p model to make the final prediction for an q attack ∀p, q ∈ {1, ∞}; p = q), and fooling the alternate M p, p model itself. Here, by "alternate" we mean that for an q attack, the prediction is made by the M p, p model, where p, q ∈ {1, ∞}; p = q. We provide an illustration of the trade-off in Figure 1b , and present the formal proof in Appendix C.

5. TRAINING AND INFERENCE

Having motivated PROTECTOR through a toy-task in Section 4, we now scale the approach to deep neural networks for common image classification benchmarks. Specifically, following prior work on defending against multiple perturbation types, we evaluate on MNIST (LeCun et al., 2010) and CIFAR-10 (Krizhevsky, 2012) datasets. Now, we discuss the training details, a strong adaptive white-box attack against PROTECTOR, and our inference procedure against such attacks.

5.1. TRAINING

To train our perturbation classifier C adv , we create a dataset that includes adversarial examples of different perturbation types. Specifically, we perform 1 , 2 , ∞ PGD attacks (Madry et al., 2018) against each of the two individual M p models used in PROTECTOR. Thus the size of our dataset is 6 times that of the original MNIST and CIFAR10 datasets respectively. For the MNIST dataset, we use the M 2 , M ∞ models in PROTECTOR, and we use M 1 , M ∞ models for CIFAR10. The choice is made based on the robustness of {M 2 , M 1 } models against the { 1 , 2 } attacks respectively, as will be depicted in Table 2 . As discussed in Section 3, to assign the ground truth label for training the perturbation classifier (C adv ), we find that it is sufficient to assign the same label to 1 and 2 attacks. In other words, C adv performs a binary classification between 1 / 2 attacks and ∞ attacks. In contrast with prior defenses against multiple perturbation types (Tramèr & Boneh, 2019; Maini et al., 2020) , which require adversarial training, we find that it is sufficient to train our PROTECTOR pipeline over a static dataset (constructed as mentioned above) to achieve high robustness. Therefore, the training of our perturbation classifier is fast and stable. Specifically, using a single P100 GPU, our perturbation classifier can be trained within 5 minutes on MNIST, and within an hour on CIFAR-10. On the other hand, training state-of-the-art models robust to a single perturbation type require up to 2 days to train on the same amount of GPU power, and existing defenses against multiple perturbation types take thrice as long as the training time for a model robust to a single perturbation type. A key advantage of PROTECTOR's design is that it can build upon existing defenses against individual perturbation types. Specifically, we leverage the adversarially trained models developed in prior work (Zhang et al., 2019; Carmon et al., 2019) as M p models in our pipeline, and the CNN architecture of C adv is also similar to a single M p model. More details are deferred to Appendix D.

5.2. ADAPTIVE ATTACKS AGAINST THE PROTECTOR PIPELINE

To generate adversarial examples against PROTECTOR, the most straightforward approach is to generate the adversarial perturbation to optimize Equation 3 using existing attack algorithms. Since the final prediction of the pipeline only depends on a single M p model, the pipeline does not allow gradient flow across the two levels, and thus makes it difficult for gradient-based adversaries to attack PROTECTOR. Therefore, besides this standard adaptive attack, in our evaluation, we also consider a stronger adaptive adversary, which utilizes a combination of the predictions from each individual second-level M p models, rather than only utilizing the predictions from a single M p model with p = argmax C adv (x) alone. Specifically, we modify f θ (x) in Equation 3 as follows: c = softmax(C adv (x)); f θ (x) = p∈S c p • M p (x) where c p denotes the probability of the input x being classified as the perturbation type p by C adv . We also experiment with other strategies of aggregating the predictions of different components, e.g., tuning hyper-parameters to balance among attacking C adv and each M p model, but these alternative methods do not perform better. Note that Equation 4 is only used for the purpose of generating adversarial examples and performing gradient-based attack optimization. For consistency throughout the paper, we still use Equation 1 to compute the model prediction at inference (final forward-propagation). We do not see any significant performance advantages of either choice at inference time, and briefly report a comparison on two attacks in Appendix H.4.

5.3. INFERENCE PROCEDURE AGAINST ADAPTIVE ADVERSARIES

Though training the perturbation classifier on a static dataset is sufficient to achieve robustness using existing attack approaches, we observe that the accuracy drops when PROTECTOR is presented with the stronger adaptive attacks discussed in Section 5.2. To improve the model robustness against such adversaries, we add random noise to the input before feeding it into PROTECTOR at the test time. While Hu et al. (2019) suggest that adding random noise does not help defend against adversarial inputs, it is the unique exhibition of the trade-off described in Theorem 2 that adversarial attacks against PROTECTOR, on the contrary, are highly likely to fail when added with random noise. Intuitively, the trade-off between fooling the two stages of PROTECTOR confines the adversary in a very small region for generating successful adversarial attacks. Consider the illustrative example in Figure 3 , where the input x with the true label y = 0 is subjected to an ∞ attack. We assume that the M ∞, ∞ model is a perfect classifier for inputs within a fixed ∞ ball. The dotted line shows the decision boundary for the perturbation classifier C adv , which correctly classifies inputs subjected to ∞ perturbations δ as ∞ attacks (green), but can misclassify samples with smaller perturbations. When the adversary adds a large perturbation δ , the prediction of M 1 for the resulted input x becomes wrong, but the perturbation classifier also categorizes it as an M ∞ attack, thus the final prediction of PROTECTOR is still correct since it will be produced by M ∞, ∞ model instead. On the other hand, when the adversary adds a small perturbation δ to fool the perturbation classifier, adding a small amount of random noise can recover the correct prediction with a high probability. Note that every point on the boundary of the noise region (yellow circle) is correctly classified by the pipeline. In this way, adding random noise exploits an adversarial trade-off for PROTECTOR to achieve a high accuracy against adversarial examples, in the absence of adversarial training. In our implementation, we sample random noise z ∼ N (0, I), and add ẑ = 2 • z/|z| 2 to the model input.

6. EXPERIMENTS

In this section, we present our experiments on MNIST and CIFAR-10 datasets. We will discuss the results for both the perturbation classifier C adv alone, and the entire PROTECTOR pipeline.

6.1. EXPERIMENTAL SETUP

Baselines. We compare PROTECTOR with the state-of-the-art defenses against multiple perturbation types, which consider the union of 1 , 2 , ∞ adversaries (Tramèr & Boneh, 2019; Maini et al., 2020) . For Tramèr & Boneh (2019) , we compare two variants of adversarial training: (1) the MAX approach, where for each image, among different perturbation types, the adversarial sample that leads to the maximum increase of the model loss is augmented into the training set; (2) the AVG approach, where adversarial examples for all perturbation types are included for training. We also evaluate the MSD algorithm proposed by Maini et al. (2020) , which modifies the standard PGD attack to incorporate the union of multiple perturbation types within the steepest decent itself. In addition, we also evaluate M 1 , M 2 , M ∞ models trained with 1 , 2 , ∞ perturbations separately, as described in Appendix D. Attack evaluation. We evaluate our methods with the strongest attacks in the adversarial learning literature, and with adaptive attacks specifically designed for PROTECTOR (Section 5.2). First, we utilize a comprehensive suite of both gradient-based and gradient-free attacks from the Foolbox library (Rauber et al., 2017) . Further, we also evaluate our method against the AutoAttack library from Croce & Hein (2020c) , which achieves the state-of-art adversarial error rates against multiple recently published models. In line with prior work (Tramèr & Boneh, 2019; Maini et al., 2020) , the radius of the { 1 , 2 , ∞ } perturbation regions is {10, 2, 0.3} for the MNIST dataset and {12, 0.5, 0.03} for the CIFAR10 dataset. We present the full details of attack algorithms in Appendix F. Following prior work (Tramèr & Boneh, 2019; Maini et al., 2020) , for both MNIST and CIFAR-10, we evaluate the models on adversarial examples generated from the first 1000 images of the test set. Our main evaluation metric is the accuracy on all attacks, which means that for an input image, if any of the attack algorithm in our suite could successfully fool the model, then the input is a failure case.

6.2. EMPIRICAL PERTURBATION OVERLAP AND CHOICE OF p

While we justify the choice of perturbation sizes in our theoretical proofs in Appendix B.4 and C.2, in this section we demonstrate the empirical agreement of the choices of perturbation sizes we make Table 1 : Studying the empirical overlap of p, p attack perturbations in different q, q regions for (a) MNIST ( 1 , 2 , ∞ ) = (10, 2.0, 0.3); (b) CIFAR-10 ( 1 , 2 , ∞ ) = (12, 0.5, 0.03). Each column represents the range (min -max) of q norm for perturbations generated using p PGD attack. In particular, the accuracy is > 95% across all the individual test sets created. These results suggest two important findings that validate our results in Theorem 1. That is, independent of (a) the model to be attacked; and (b) the algorithm for generating the optimal adversarial perturbation, the optimal adversarial images for a given p region follow similar distributions. We present the full results in Appendix H.1. Attack MNIST CIFAR10 ∞ < 0.3 2 < 2.0 1 < 10 ∞ < 0.03 2 < 0.5 1 < 12 PGD ∞ ≤ 0.

6.4. RESULTS OF THE PROTECTOR PIPELINE

Overall results. In Table 2 , we summarize the worst-case performance against all attacks within a given perturbation type for MNIST and CIFAR-10 datasets. In particular, 'Ours' denotes the robustness of PROTECTOR against the adaptive attacks described in Section 5.2, and 'Ours*' denotes the robustness of PROTECTOR against standard attacks based on Equation 1. The adaptive strategy PROTECTOR retains a high accuracy on benign images, as opposed to past defenses that have to sacrifice the benign accuracy for the robustness on multiple perturbation types. In particular, the clean accuracy of PROTECTOR is over 6% higher than such baselines on CIFAR-10, and the accuracy is similar to that of M p models trained for a single perturbation type. The effect of noise. As discussed in Section 5.3, though adding random noise is not required to defend against standard attacks, it is helpful in defending against the stronger adaptive adversary against our pipeline. Specifically, in Table 3 , we present the results on adversarial examples generated by PGD-based algorithms, which are amongst the strongest gradient-based attacks in the literature. We observe a consistent improvement among all attacks, increasing the accuracy by up to 10%. Different number of second-level M p predictors. We also evaluate our PROTECTOR approach with three second-level predictors, i.e., M 1 , M 2 and M ∞ , and we present the results in Table 3 , this alternative design considerably reduces the overall accuracy of the pipeline model. We hypothesize that this happens because the M 1 model is already reasonably robust against the 2 attacks, as shown in Table 2b . However, having both M 1 and M 2 models allows adaptive adversaries to find larger regions for fooling both C adv and M p , thus hurts the overall performance against adaptive adversaries.

7. CONCLUSION

In this work, we propose PROTECTOR, which performs perturbation type categorization towards achieving robustness against the union of multiple perturbation types. Based on a simplified problem setup, theoretically, we demonstrate that adversarial inputs of different attack types naturally have different distributions and can be separated. We further elaborate the existence of a natural tension for any adversary trying to fool our model -between fooling the attack classifier and the specialized robust predictors. Our empirical results on MNIST and CIFAR-10 datasets complement our theoretical analysis. In particular, by posing another adversarial trade-off through the effect of random noise, our PROTECTOR pipeline outperforms existing defenses against multiple p attacks by over 5%. Our work serves as a stepping stone towards the goal of universal adversarial robustness, by dissecting various adversarial objectives into individually solvable pieces and combing them via PROTECTOR. Our study opens up various exciting future directions, including the new problem of perturbation categorization, extending our approach to defend attacks beyond p adversarial examples, and defining sub-classes of perturbation types to further improve the overall adversarial robustness. A PROBLEM SETTING: THEORETICAL ANALYSIS In this section, we formally define the problem setting and motivate the distinctions made with respect to the problem studied by Ilyas et al. (2019) . The classification problem consists of two tasks: (1) Predicting the correct class label of an adversarially perturbed (or benign) image using adversarially robust classifier M p ; and (2) Predicting the type of adversarial perturbation that the input image was subjected to, using attack classifier C adv . Setup We consider the data to consist of inputs to be sampled from two multi-variate Gaussian distributions such that the input-label pairs (x,y) can be described as: y u.a.r ∼ {-1, +1}, x 0 ∼N (yα, σ 2 ), x 1 , . . . , x d i.i.d ∼ N (yη, σ 2 ) (5) where the input d+1) . We can assume without loss of generality, that the mean for the two distributions has the same absolute value, since for any two distributions with mean µ 1 , µ 2 , we can translate the origin to µ1+µ2 2 . This setting demonstrates the distinction between an input feature x 0 that is strongly correlated with the input label and d weakly correlated features that are normally distributed (independently) with mean yη and variance σ 2 each. We adapt this setting from Ilyas et al. (2019) who used a stochastic feature x 0 = y with probability p, as opposed to a normally distributed input feature as in our case. All our findings hold in the other setting as well, however, the chosen setting better represents true data distribution, with some features that are strongly correlated to the input label, while others that have only a weak correlation. x ∼ N (yµ, Σ) ∈ R (d+1) ; η = α/ √ d for some positive constant α; µ = [α, η, . . . , η] ∈ R + (d+1) and Σ = σ 2 I ∈ R + (d+1)×(

B SEPARABILITY OF PERTURBATION TYPES (THEOREM 1)

In this section, our goal is to evaluate whether the optimal perturbation confined within different p balls have different distributions and whether they are separable. We do so by developing an error bound on the maximum error in classification of the perturbation types. The goal of the adversary is to fool a standard (non-robust) classifier M . C adv aims to predict the perturbation type based on only viewing the adversarial image, and not the delta perturbation. First, in Appendix B.1 we define a binary Gaussian classifier that is trained on the given task. Given the weights of the binary classifier, we then identify the optimal adversarial perturbation for each of the 1 , 2 , ∞ attack types in Appendix B.2. In Appendix B.3 we define the difference between the adversarial input distribution for different p balls. Finally, we calculate the error in classification of these adversarial input types in Appendix B.4 to conclude the proof of Theorem 1.

B.1 BINARY GAUSSIAN CLASSIFIER

We assume for the purposes of this work that we have enough input data to be able to empirically estimate the parameters µ, σ of the input distribution via sustained sampling. The multivariate Gaussian representing the input data is given by: p(x|y = y i ) = 1 (2π) d |Σ| exp - 1 2 (x -y i .µ) T Σ -1 (x -y i .µ) , ∀y i ∈ {-1, 1} We want to find p(y = y i |x) ∀y i ∈ {-1, +1}. From Bayesian Decision Theory, the optimal decision rule for separating the two distributions is given by: p(y = 1)p(x|y = 1)  Therefore, for two Gaussian Distributions N (µ 1 , Σ 1 ), N (µ 2 , Σ 2 ), we have: 0 y=1 < x Ax -2b x + c A = Σ -1 1 -Σ -1 2 b = Σ -1 1 µ 1 -Σ -1 2 µ 2 c = µ 1 Σ -1 1 µ 1 -µ 2 Σ -1 2 µ 2 + log Σ 1 Σ 2 -2 log p(y = 1) p(y = -1) Substituting ( 6) and ( 7) in ( 8), we find that the optimal Bayesian decision rule for our problem is given by: x µ y=1 > 0 (9) which means that the label for the input can be predicted with the information of the sign of x µ alone. We can define the parameters W ∈ R d+1 of the optimal binary Gaussian classifier M W , such that W 2 = 1 as: W 0 = α √ 2 , W i = α √ 2d ∀i ∈ {1, . . . , d} M W (x) = x W (10) B.2 OPTIMAL ADVERSARIAL PERTURBATION AGAINST M W Now, we calculate the optimal perturbation δ that is added to an input by an adversary in order to fool our model. For the purpose of this analysis, we only aim to fool a model trained on the standard classification metric as discussed in Section 4 (and not an adversarially robust model). The parameters of our model are defined in (10). The objective of any adversary δ ∈ ∆ is to maximize the loss of the label classifier M W . We assume that the classification loss is given by -y × M W (x + δ). The object of the adversary is to find δ * such that: (x + δ, y; M W ) = -y × M W (x + δ) = -yx W δ * = arg max δ∈∆ (x + δ, y; M W ) = arg max δ∈∆ -y(x + δ) W = arg max δ∈∆ -yδ W We will now calculate the optimal perturbation in the p balls ∀p ∈ {1, 2, ∞}. For the following analyses, we restrict the perturbation region ∆ to the corresponding p ball of radius { 1 , 2 , ∞ } respectively. We also note that the optimal perturbation exists at the boundary of the respective p balls. Therefore, the constraint can be re-written as : δ * = arg max δ p= p -yδ W (12) We use the following properties in the individual treatment of p balls: δ p = i |δ i | p 1 p ∂ j δ p = 1 p i |δ i | p 1 p -1 • p|δ j | p-1 sgn(δ j ) = |δ j | δ p p-1 sgn(δ j ) p = 2 Making use of langrange multipliers to solve (12), we have: ∇ δ (-δ Σ -1 µ) = λ∇ δ ( δ 2 p -2 p ) -W = λ δ p ∇ δ ( δ p ) (14) Combining the results from ( 13) and replacing δ with δ 2 we obtain : -W = λ δ 2 2 |δ 2 | δ 2 2 sgn(δ 2 ) δ 2 = -2 W W 2 = -2 W (15) p = ∞ Recall that the optimal perturbation is given by : 16) is maximized when each δ i = -y ∞ sgn W i ∀i ∈ {0, . . . , d}. Further, since the weight matrix only contains non-negative elements (α is a positive constant), we can conclude that the optimal perturbation is given by: δ * = arg max δ ∞= ∞ -yδ W = arg max δ ∞= ∞ -y d i=0 δ i W i (16) Since δ ∞ = ∞ , we know that max i |δ i | = ∞ . Therefore ( δ ∞ = -y ∞ 1 p = 1 We attempt an analytical solution for the optimal perturbation δ 1 . Recall that the optimal perturbation is given by : δ * = arg max δ 1= 1 -y d i=1 δ i W i = arg max δ 1= 1 -yδ 0 W 0 -y d i=1 δ i W i = arg max δ 1= 1 -yδ 0 α √ 2 -y d i=1 δ i α √ 2d Since δ 1 = 1 , ( 18) is maximized when: δ 0 = -y 1 sgn(α) = -y 1 , δ i = 0 ∀i ∈ {1 . . . d} Combining the results From the preceding discussion, it may be noted that the new distribution of inputs within a given label changes by a different amount δ depending on the perturbation type. Moreover, if the mean and variance of the distribution of a given label are known (which implies that the corresponding true data label is also known), the optimal perturbation is independent of the input itself, and only dependent on the respective class statistics (Note that the input is still important in order to understand the true class).

B.3 PERTURBATION CLASSIFICATION BY C adv

In this section, we aim to verify if it is possible to accurately separate the optimal adversarial inputs crafted within different p balls. For the purposes of this discussion, we only consider the problem of classifying perturbation types into 1 and ∞ , but the same analysis may also be extended more generally to any number of perturbation types. We will consider the problem of classifying the correct attack label for inputs from true class y = 1 for this discussion. Note that the original distribution: X true ∼ N (y.µ, Σ) Since the perturbation value δ p is fixed for all inputs corresponding to a particular label, the new distribution of perturbed inputs X 1 and X ∞ in case of 1 and ∞ attacks respectively (for y = 1) is given by: X 1 ∼ N (µ + δ 1 , Σ) X ∞ ∼ N (µ + δ ∞ , Σ) We now try to evaluate the conditions under which we can separate the two Gaussian distributions with an acceptable worst-case error.

B.4 CALCULATING A BOUND ON THE ERROR

Classification Error A classification error occurs if a data vector x belongs to one class but falls in the decision region of the other class. That is in (7) the decision rule indicates the incorrect class. (This can be understood through the existence of outliers) P e = P (error|x)p(x)dx = min [p(y = 1 |x)p(x), p(y = ∞ |x)p(x)] dx Perturbation Size We set the radius of the ∞ ball, ∞ = η and the radius of the 1 ball, 1 = α. We further extend the discussion about suitable perturbation sizes in Appendix C.2. These values ensure that the ∞ adversary can make all the weakly correlated labels meaningless by changing the expected value of the adversarial input to less than 0 (E[x i + δ ∞ (i)] ∀i > 0), while the 1 adversary can make the strongly correlated feature x 0 meaningless by changing its expected value to less than 0 (E[x 0 + δ 1 (0)]). However, neither of the two adversaries can flip all the features together. Translating the axes We can translate the axis of reference by -µ -δ1+δ∞ 2 and define µ adv = δ1-δ∞ 2 , such that : X 1 ∼ N (µ adv , Σ) X ∞ ∼ N (-µ adv , Σ) We can once again combine this with the simplified Bayesian model in ( 9) to obtain the classification rule given by: x µ adv p=1 > 0 (23) Combining the optimal perturbation definitions in ( 17) and ( 19) that µ adv = δ1-δ∞ 2 = 1 2 [-1 + ∞ , ∞ , . . . , ∞ ]. We can further substitute 1 = α and ∞ = η = α √ d . Notice that µ adv (i) > 0 ∀i > 0. Without loss of generality, to simplify further discussion we can flip the coordinates of x 0 , since all dimensions are independent of each other. Therefore, µ adv = α 2 √ d √ d -1, 1, . . . , 1 . Consider a new variable x z such that: x z = x 0 • 1 - 1 √ d + 1 √ d d i=1 x i = 2 α x µ adv (24) since each x i ∀i ≥ 0 is independently distributed, the new feature x z ∼ N (µ z , σ 2 z ), where µ z = α 1 - 1 √ d + 1 √ d d i=1 α √ d = 2α - α √ d σ 2 z = σ 2 1 + 1 d -2 1 √ d + d i=1 1 d = σ 2 2 + 1 d -2 1 √ d (25) Therefore, the problem simplifies to calculating the probability that the meta-variable x z > 0. For α σ > 10 and d > 1, we have in the z-table, z > 10: P e ≤ 10 -24 (26) which suggests that the distributions are significantly distinct and can be easily separated. This concludes the proof for Theorem 1.

Note:

We can extend the same analysis to other p balls as well, but we consider the case of 1 and ∞ for simplicity.

C ROBUSTNESS OF THE PROTECTOR PIPELINE (THEOREM 2)

In the previous section, we show that it is indeed possible to distinguish between the distribution of inputs of a given class that were subjected to 1 and ∞ perturbations over a standard classifier. Now, we aim to develop further understanding of the robustness of our two-stage pipeline in a dynamic attack setting with multiple labels to distinguish among. The first stage is a preliminary classifier C adv that classifies the perturbation type and the second stage consists of multiple models M p that were specifically trained to be robust to perturbations to the input within the corresponding p norm. First, in Appendix C.1, we calculate the optimal weights for a binary Gaussian classifier M p , trained on dataset D to be robust to adversaries within the p ball ∀p ∈ {1, ∞}. Based on the weights of the individual model, we fix the perturbation size p to be only as large, as is required to fool the alternate model with high probability. Here, by 'alternate' we mean that for an q attack, the prediction should be made by the M p, p model,where p, q ∈ {1, ∞}; p = q. In Appendix C.3 we calculate the robustness of individual M p models to p adversaries, given the perturbation size p as defined in Appendix C.2. In Appendix C.4, we analyze the modified distributions of the perturbed inputs after different p attacks. Based on this analysis, we construct a simple decision rule for the perturbation classifier C adv . Finally, in Appendix C.5 we determine the perturbation induced by the worst-case adversary that has complete knowledge of both C adv and M p, p ∀p ∈ {1, ∞}. We show how there exists a trade-off between fooling the perturbation classifier (to allow the alternate M p, p model to make the final prediction), and fooling the alternate M p, p model itself. Perturbation Size We set the radius of the ∞ ball, ∞ = η + ζ ∞ and the radius of the 1 ball, 1 = α + ζ 1 , where ζ p are some small positive constants that we calculate in Appendix C.2. These values ensure that the ∞ adversary can make all the weakly correlated labels meaningless by changing the expected value of the adversarial input to less than 0 (E[x i + δ ∞ (i)] ∀i > 0), while the 1 adversary can make the strongly correlated feature x 0 meaningless by changing its expected value to less than 0 (E[x 0 + δ 1 (0)]). However, neither of the two adversaries can flip all the features together. The exact values of ζ p determine the exact success probability of the attacks. We defer this calculation to later when we have calculated the weights of the models M p . For the following discussion, it may be assumed that ζ p → 0 ∀p ∈ {1, ∞}.

C.1 BINARY GAUSSIAN CLASSIFIER M p

Extending the discussion in Appendix B.1, we now examine the learned weights of a binary Gaussian classifier M p that is trained to be robust against perturbations within the corresponding p ball of radius p . The optimization equation for the classifier can be formulated as follows: min W E -yx W + 1 2 λ||W|| 2 2 ( ) where λ is tuned in order to make the 2 norm of the optimal weight distribution, ||W * || 2 , = 1. Following the symmetry argument in Lemma D. 1 Tsipras et al. (2018) we extend for the binary Gaussian classifier that : W * i = W * j = W M ∀i, j ∈ {1, . . . , d} We deal with the cases pertaining to p ∈ {∞, 1} in this section. For both the cases, we consider existential solutions for the classifier M p to simplify the discussion. This gives us lower bounds on the performance of the optimal robust classifier. The robust objective under adversarial training can be defined as: min W max δ p ≤ p E W 0 • (x 0 + δ 0 ) + W M • d i=1 (x i + δ i ) + 1 2 λ W 2 2 min W -1 W 0 α + d × W M α √ d + 1 2 λ W 2 2 + max δ p≤ p E -y W 0 δ 0 + W M d i=1 δ i Further, since the λ constraint only ensures that ||W * || 2 = 1, we can simplify the optimization equation by substituting 17) the optimal perturbation δ ∞ is given by -y ∞ 1. The optimization equation is simplified to: W 0 = 1 -d • W M 2 as follows, min W M -1 α 1 -d • W M 2 + d × W M α √ d + max δ p ≤ p E -y δ 0 1 -d • W M 2 + W M d i=1 δ i (30) p = ∞ As discussed in ( min W M ( ∞ -α) 1 -d • W M 2 + d × W M ∞ - α √ d (31) Recall that ∞ = α √ d + ζ ∞ . To simplify the following discussion we use the weights of a classifier trained to be robust against perturbations within the ∞ ball of radius ∞ = α √ d . The optimal solution is then given by: lim ζ∞→0 W M = 0 (32) Therefore, the classifier weights are given by W = [W 0 , W 1 , . . . , W d ] = [1, 0, . . . , 0]. We also show later in Appendix C.3 that the model achieves greater than 99% accuracy against ∞ adversaries for the chosen values of ζ ∞ .

p = 1

We consider an analytical solution to yield optimal weights for this case. Recall from ( 19) that the optimal perturbation δ 1 depends on the weight distribution of the classifier. Therefore, if W 0 > W M the optimization equation can be simplified to min W W 0 ( 1 -α) -d × W M α √ d + 1 2 λ W 2 2 , ( ) and if W M > W 0 min W -W 0 α -W M √ dα -1 + 1 2 λ W 2 2 (34) Recall that 1 = α + ζ 1 . Once again to simplify the discussion that follows we will lower bound the robust accuracy of the classifier M 1 by considering the optimal solution when zeta 1 = 0. The optimal solution is then given by: lim ζ1→0 W M = 1 For the robust classifier M 1 , the weights W = [W 0 , W 1 , . . . , W d ] = [0, 1 √ d , 1 √ d , . . . , 1 √ d ]. While this may not be the optimal solution for all values of ζ 1 , we are only interested in a lower bound on the final accuracy and the classifier described by weights W simplifies the discussion hereon. We also show later in Appendix C.3 that the model achieves greater than 99% accuracy against 1 adversaries for the chosen values of ζ 1 .

C.2 PERTURBATION SIZES FOR FOOLING M p MODELS

Now that we exactly know the weights of the learned robust classifiers M 1 and M ∞ , we can move towards calculating values ζ 1 and ζ ∞ for the exact radius of the perturbation regions for the 1 and ∞ metrics. We set the radii of these regions in such a way that an 1 adversary can fool the model M ∞ with probability ∼ 98% (corresponding to z = 2 in the z-table for normal distributions), and similarly, the success of ∞ attacks against the M 1 model is ∼ 98%. Let P p1,p2 represent the probability that model M p1 correctly classifies an adversarial input in the p2 region. For p 1 = ∞ and p 2 = 1, P ∞,1 = P x∼N (yµ,Σ) [y • M ∞ (x + δ 1 ) > 0] = P x∼N (yµ,Σ) [y • (x + δ 1 ) W > 0] ≥ P x∼N (µ,Σ) [x 0 > 1 ] z = 1 -α σ = α + ζ 1 -α σ = ζ 1 σ = 2 ζ 1 = 2σ 1 = α + 2σ To simplify the discussion for the M 1 model, we define a meta-feature x M as: x M = 1 √ d d i=1 x i , which is distributed as : x M ∼ N (yη √ d, σ 2 ) ∼ N (yα, σ 2 ) For p 1 = 1 and p 2 = ∞, P 1,∞ = P x∼N (yµ,Σ) [y • M 1 (x + δ ∞ ) > 0] = P x∼N (yµ,Σ) [y • (x + δ ∞ ) W > 0] = P x∼N (yµ,Σ) [y • 1 √ d d i=1 (x i + δ ∞ (i)) > 0] = P x∼N (yµ,Σ) [y • (x M - √ d • ∞ ) > 0] ≥ P x∼N (µ,Σ) x M > √ d • ∞ z = √ d • ∞ -α σ = α + √ d • ζ ∞ -α σ = √ d • ζ ∞ σ = 2 ζ ∞ = 2σ √ d ∞ = α + 2σ √ d (37) C.3 ROBUSTNESS OF INDIVIDUAL M p MODELS Additional assumptions We add the following assumptions: (1) the dimensionality parameter d of input data is larger than 100; and (2) the ratio of the mean and variance for feature x 0 is greater than 10. d ≥ 100, α σ ≥ 10 We define P p as the probability that for any given input x ∼ N (yµ, Σ), the classifier M p outputs the correct label y for the input x + δ p . p = ∞ P ∞,∞ = P x∼N (yµ,Σ) [y • M ∞ (x + δ ∞ ) > 0] = P x∼N (yµ,Σ) [y • (x + δ ∞ ) W > 0] = P x∼N (yµ,Σ) [y • (x 0 + δ ∞ (0)) > 0] ≥ P x∼N (µ,Σ) [x 0 > ∞ ] z = ∞ -α σ = α σ 1 √ d -1 + 2 √ d using the assumptions in (38), P ∞,∞ ≥ 0.999 (40) p = 1 P 1,1 = P x∼N (yµ,Σ) [y • M 1 (x + δ 1 ) > 0] = P x∼N (yµ,Σ) [y • (x + δ 1 ) W > 0] = P x∼N (yµ,Σ) [y • 1 √ d d i=1 (x i + δ 1 (i)) > 0] = P x∼N (yµ,Σ) [y • (x M + δ M ) > 0] ≥ P x∼N (µ,Σ) x M > 1 √ d z = 1 √ d -α σ = α σ 1 √ d -1 + 2 √ d using the assumptions in (38), P 1,1 ≥ 0.999

C.4 DECISION RULE FOR C adv

We aim to provide a lower bound on the worst-case accuracy of the entire pipeline, through the existence of a simple decision tree C adv . For given perturbation budgets 1 and ∞ , we aim to understand the range of values that can be taken by the adversarial input. Consider the scenarios described in Table 4 below:   Table 4: The table shows the range of the values that the mean can take depending on the decision taken by the adversary. µ adv 0 and µ adv M represent the new mean of the distribution of features x 0 and x M after the adversarial perturbation. Attack Type µ adv 0 µ adv M y = 1 y = -1 y = 1 y = -1 None α -α η √ d -η √ d ∞ {α -∞, α + ∞} {-α -∞, -α + ∞} {η √ d + ∞d, η √ d -∞d} {-η √ d + ∞d, -η √ d -∞d} 1 {α -1, α + 1} {-α -1, -α + 1} {η √ d + 1, η √ d -1} {-η √ d + 1, -η √ d -1} Note that any adversary that moves the perturbation away from the y-axis is uninteresting for our comparison, since irrespective of a correct perturbation type prediction by C adv , either of the two second level models naturally obtain a high accuracy on such inputs. Hence, we define the following decision rule with all the remaining cases mapped to 1 perturbation type. C adv (x) = 1, if ||x 0 | -α| < ∞ + α 2 0, otherwise where the output 1 corresponds to the classifier predicting the presence of ∞ perturbation in the input, while an output of 0 suggests that the classifier predicts the input to contain perturbations of the 1 type. If we consider a black-box setting where the adversary has no knowledge of the classifier C adv , and can only attack M p it is easy to see that the proposed pipeline obtains a high adversarial accuracy against the union of 1 and ∞ perturbations: Note: (1) There exists a single model that can also achieve robustness against the union of 1 and ∞ perturbations, however, learning this model may be more challenging in real data settings. (2) The classifier need not be perfect.

C.5 TRADE-OFF BETWEEN ATTACKING M p AND C adv

To obtain true robustness it is important that the entire pipeline is robust against adversarial attacks. More specifically, in this section we demonstrate the natural tension that exists between fooling the top level attack classifier (by making an adversarial attack less representative of its natural distribution) and fooling the bottom level adversarially robust models (requiring stronger attacks leading to a return to the attack's natural distribution). The accuracy of the pipelined model f against any input-label pair (x, y) sampled through some distribution N (yµ adv , Σ) (where µ adv incorporates the change in the input distribution owing to the adversarial perturbation) is given by: P [f (x) = y] = P x∼N (yµ adv ,Σ) [C adv (x)] P x∼N (yµ adv ,Σ) [y • M ∞ (x) > 0|C adv (x)] + (1 -P x∼N (yµ adv ,Σ) [C adv (x)])P x∼N (yµ adv ,Σ) [y • M 1 (x) > 0|¬C adv (x)] = P x∼N (µ adv ,Σ) [C adv (x)] P x∼N (µ adv ,Σ) [M ∞ (x) > 0|C adv (x)] + (1 -P x∼N (µ adv ,Σ) [C adv (x)])P x∼N (µ adv ,Σ) [M 1 (x) > 0|¬C adv (x)] ∞ adversary: To simplify the analysis, we consider loose lower bounds on the accuracy of the model f against the ∞ adversary. Recall that the decision of the attack classifier is only dependent of the input x 0 . Irrespective of the input features x i ∀i > 0, it is always beneficial for the adversary to perturb the input by µ i = -∞ . However, the same does not apply for the input x 0 . Analyzing for the scenario when the true label y = 1, if the input x 0 lies between α 2 -∞ of the mean α, irrespective of the perturbation, the output of the attack classifier C adv = 1. The M ∞ model then always correctly classifies these inputs. The overall robustness of the pipeline requires analysis for the case when input lies outside α 2 -∞ of the mean as well. However, we consider that the adversary always succeeds in such a case in order to only obtain a loose lower bound on the robust accuracy of the pipeline model f against ∞ attacks. P [f (x) = y] = P x∼N (µ adv ,Σ) [C adv (x)] P x∼N (µ adv ,Σ) [M ∞ (x) > 0|C adv (x)] + (1 -P x∼N (µ adv ,Σ) [C adv (x)])P x∼N (µ adv ,Σ) [M 1 (x) > 0|¬C adv (x)] ≥ P x∼N (µ adv ,Σ) [C adv (x)] P x∼N (µ adv ,Σ) [M ∞ (x) > 0|C adv (x)] ≥ P x∼N (µ,Σ) |x 0 -α| ≤ α 2 -∞ ≥ 2P x∼N (µ,Σ) x 0 ≤ α - α 2 + ∞ z = (α -α 2 + ∞ ) -α σ = - α 2σ + 3σ 2σ √ d using the assumptions in (38), P [f (x) = y] ∼ 0.99 1 adversary: It may be noted that a trivial way for the 1 adversary to fool the attack classifier is to return a perturbation δ 1 = 0. In such a scenario, the classifier predicts that the adversarial image was subjected to an ∞ attack. The label prediction is hence made by the M ∞ model. But we know from ( 40) that the M ∞ model predicts benign inputs correctly with a probability P ∞,∞ > 0.99, hence defeating the adversarial objective of misclassification. To achieve misclassification over the entire pipeline the optimal perturbation decision for the 1 adversary when x 0 ∈ -α -α 2 -1 , -α + α 2 + 1 the adversary can fool the pipeline by ensuring that the C adv (x) = 1. However, in all the other cases irrespective of the perturbation, either C adv = 0 or the input features x 0 has the same sign as the label y. Since, P 1,1 > 0.99 for the M 1 model, for all the remaining inputs x 0 the model correctly predicts the label with probability greater than 0.99 (approximate lower bound). We formulate this trade-off to elaborate upon the robustness of the proposed pipeline. P [f (x) = y] = P x∼N (µ adv ,Σ) [C adv (x)] P x∼N (µ adv ,Σ) [M ∞ (x) > 0|C adv (x)] + (1 -P x∼N (µ adv ,Σ) [C adv (x)])P x∼N (µ adv ,Σ) [M 1 (x) > 0|¬C adv (x)] ≥ P x∼N (µ,Σ) -α - α 2 -1 ≤ x 0 ≤ -α + α 2 + 1 + 0.999(P x∼N (µ,Σ) x 0 < -α - α 2 -1 or x 0 > -α + α 2 + 1 ) ≥ 0.999(P x∼N (µ,Σ) x 0 < -α - α 2 -1 or x 0 > -α + α 2 + 1 ) using the assumptions in (38), P [f (x) = y] ∼ 0.99 This concludes the proof for Theorem 2, showing that an adversary can hardly stage successful attacks on the entire pipeline and faces a natural tension between attacking the label predictor and the attack classifier. Finally, we emphasize that the shown accuracies are lower bounds on the actual robust accuracy, and the objective of this analysis is not to find the optimal solution to the problem of multiple perturbation adversarial training, but to expose the existence of the trade-off between attacking the two stages of the pipeline.

D MODEL ARCHITECTURE

Second-level M p models. A key advantage of our PROTECTOR design is that we can build upon existing defenses against individual perturbation type. Specifically, for MNIST, we use the same CNN architecture as Zhang et al. (2019) for our M p models, and we train these models using their proposed TRADES loss. For CIFAR-10, we use the same training setup and model architecture as Carmon et al. (2019) , which is based on a robust self-training algorithm that utilizes unlabeled data to improve the model robustness. Perturbation classifier C adv . For both MNIST and CIFAR-10 datasets, the architecture of the perturbation classifier C adv is similar to the individual M p models. Specifically, for MNIST, we use the CNN architecture in Zhang et al. (2019) with four convolutional layers, followed by two fully-connected layers. For CIFAR-10, C adv is a WideResNet (Zagoruyko & Komodakis, 2016) model with depth 28 and widening factor of 10 (WRN-28-10). E TRAINING DETAILS E.1 SPECIALIZED ROBUST PREDICTORS M p MNIST. We use the Adam optimizer (Kingma & Ba, 2015) to train our models along with a piece-wise linearly varying learning rate schedule (Smith, 2018) to train our models with maximum learning rate of 10 -3 . The base models M 1 , M 2 , M ∞ are trained using the TRADES algorithm for 20 iterations, and step sizes α 1 = 2.0, α 2 = 0.3, and α ∞ = 0.05 for the 1 , 2 , ∞ attack types within perturbation radii 1 = 10.0, 2 = 2.0, and ∞ = 0.3 respectively.foot_1 CIFAR10. The individual M p models are trained to be robust against CIFAR10. We use a learning rate of 0.01 and SGD optimizer for 5 epochs, with linear rate decay to 0.001 between the fourth epoch and the tenth epoch. The batch size is set to 100 for all experiments. { ∞ , 1 , 2 } perturbations of { ∞ , 1 , 2 } = {0. Creating the Adversarial Perturbation Dataset. We create a static dataset of adversarially perturbed images and their corresponding attack label for training the perturbation classifier C adv . For generating adversarial images, we perform weak adversarial attacks that are faster to compute. In particular, we perform 10 iterations of the PGD attack. For MNIST, the attack step sizes {α ∞ , α 1 , α 2 } = {0.05, 2.0, 0.3} respectively. For CIFAR10, the attack step sizes {α ∞ , α 1 , α 2 } = {0.005, 2.0, 0.1} respectively. Note that we perform the Sparse-1 or the top-k PGD attack for the 1 perturbation ball, as introduced by Tramèr & Boneh (2019) . We set the value of k to 10, that is we move by a step size α1 k in each of the top 10 directions with respect to the magnitude of the gradient.

F ATTACKS USED FOR EVALUATION

A description of all the attacks used for evaluation of the models is presented here. From the Foolbox library (Rauber et al., 2017) , apart from 1 , 2 and ∞ PGD adversaries, we also evaluate the following attacks for different perturbation types. (1) For 1 perturbations, we include the Salt & Pepper Attack (SAPA) (Rauber et al., 2017) and Pointwise Attack (PA) (Schott et al., 2018) . Table 5 : Vanilla Model: Empirical overlap of p, p attack perturbations in different q, q regions for (a) MNIST ( 1 , 2 , ∞ ) = (10, 2.0, 0.3); (b) CIFAR-10 ( 1 , 2 , ∞ ) = (12, 0.5, 0.03). Each column represents the range (min -max) of q norm for perturbations generated using p PGD attack. (2) For 2 perturbations, we include the Gaussian noise attack (Rauber et al., 2017) , Boundary Attack (Brendel et al., 2018 ), DeepFool (Moosavi-Dezfooli et al., 2016) , DDN attack (Rony et al., 2019) , and C&W attack (Carlini & Wagner, 2017b) .

Attack

MNIST CIFAR10 ∞ < 0.3 2 < 2.0 1 < 10 ∞ < 0.03 2 < 0.5 1 < 12 PGD ∞ ≤ 0. (3) For ∞ perturbations, we include FGSM attack (Goodfellow et al., 2015) and the Momentum Iterative Method (Dong et al., 2018) . From the AutoAttack library from Croce & Hein (2020c), we make use of all the three variants of the Adaptive PGD attack (APGD-CE, APGD-DLR, APGD-T) (Croce & Hein, 2020c) along with the targeted and standard version of Fast Adaptive Boundary Attack (FAB, FAB-T) (Croce & Hein, 2020b) and the Square Attack (Andriushchenko et al., 2020) . We utilize the AA + version for strong attacks. Attack Hyperparameters For the attacks in the Foolbox and AutoAtack libraries we use the default parameter setting in the strongest available mode (such as AA + ). For the custom PGD attacks, we evaluate the models with 10 restarts and 200 iterations of the PGD attack. The step size of the { ∞ , 1 , 2 } PGD attacks are set as follows: For MNIST, the attack step sizes {α ∞ , α 1 , α 2 } = {0.01, 1.0, 0.1} respectively. For CIFAR10, the attack step sizes {α ∞ , α 1 , α 2 } = {0.003, 1.0, 0.02} respectively. Further, similar to Tramèr & Boneh (2019) ; Maini et al. (2020) we evaluate our models on the first 1000 images of the test set of MNIST and CIFAR-10, since many of the attacks employed are extremely computationally expensive and slow to run. Specifically, on a single GPU, the entire evaluation for a single model against all the attacks discussed with multiple restarts will take nearly 1 month, and is not feasible.

G EMPIRICAL PERTURBATION OVERLAP

Following Section 6.2, we also present results on the perturbation overlap when we attack PROTECTOR with PGD attacks. To contrast the results with that of attacking a vanilla model, we also present the table in the main paper for convenience. It is noteworthy that the presence of a perturbation classifier forces the adversaries to generate such attacks that increase the norm of the perturbations in alternate q region. Secondly, we also observe that in the case of CIFAR10, the 2 PGD attack has a large overlap with the 1 norm of radius 10. However, recall that in case of 2 attacks for CIFAR10, both the base models M 1 and M ∞ were satisfactorily robust. Hence, the attacker has no incentive to reduce the perturbation radius for an q norm since the perturbation classifier only performs a binary classification between 1 and ∞ attacks. The results can be observed in Tables 5 and 6 . 

H BREAKDOWN OF COMPLETE EVALUATION

In this section, we present the results of the perturbation type classifier C adv against transfer adversaries. We also present the breakdown results of the adversarial robustness of baseline approaches and our PROTECTOR pipeline against all the attacks that we tried, and also report the worst case performance against the union of all attacks.

H.1 ROBUSTNESS OF C adv

The results for the robustness of the perturbation classifier C adv in the presence of adaptive adversaries is presented in Table 7 . Note that C adv transfers well across the board, even if the adversarial examples are generated against new models that are unseen for C adv during training, achieving extremely high test accuracy. Further, even if the adversarial attack was generated by a different algorithm such as from the AutoAttack library, the transfer success of C adv still holds up. In particular, the obtained accuracy is > 95% across all the individual test sets created. The attack classification accuracy is in general highest against those generated by attacking M 1 or M ∞ for CIFAR10, and M 2 or M ∞ for MNIST. This is an expected consequence of the nature of generation of the static dataset for training the perturbation classifier C adv as described in Section 5.1.

H.2 MNIST

In Table 8 , we provide a breakdown of the adversarial accuracy of all the baselines, individual M p models and the PROTECTOR method, with both the adaptive and standard attack variants on the MNIST dataset. PROTECTOR outperforms prior baselines by 6.9% on the MNIST dataset. It is important to note that PROTECTOR shows significant improvements against most attacks in the suite. Compared to the previous state-of-the-art defense against multiple perturbation types (MSD), if we compare the performance gain on each individual attack algorithm, the improvement is also significant, with an average accuracy increase of 15.5% on MNIST dataset. These results demonstrate that PROTECTOR considerably mitigates the trade-off in accuracy against individual attack types.

H.3 CIFAR-10

In Table 9 , we provide a breakdown of the adversarial accuracy of all the baselines, individual M p models and the PROTECTOR method, with both the adaptive and standard attack variants on the CIFAR10 dataset. PROTECTOR outperforms prior baselines by 8.9%. Once again, note that PROTECTOR shows significant improvements against most attacks in the suite. Compared to the previous state-of-the-art defense against multiple perturbation types (MSD), if we compare the performance gain on each individual attack algorithm, the improvement is significant, with an average accuracy increase of 14.2% on. These results demonstrate that PROTECTOR considerably mitigates the trade-off in accuracy against individual attack types. Further, PROTECTOR also retains a higher accuracy on benign images, as opposed to past defenses that have to sacrifice the benign accuracy for the robustness on multiple perturbation types. In particular, the clean accuracy of PROTECTOR is over 6% higher than such existing defenses on CIFAR-10, and the accuracy is close to M p models trained for a single perturbation type. 4for the column 'Ours' and using the 'max' strategy (Equation 1) for results described in the column 'Ours*' However, for consistency of our defense strategy irrespective of the attacker's strategy, the defender only utilizes predictions from the specialized model M p corresponding to the most-likely attack (Equation 1) to provide the final prediction (only forward propagation) for generated adversarial examples. In our evaluation, we found a negligible impact of changing this aggregation to the 'softmax' strategy for aggregating the predictions. For example, we show representative results in case of the APGD ( ∞ , 2 ) attacks on the CIFAR10 dataset in Table 10 .



We will open-source the code, pre-trained models, and perturbation type datasets upon publication. We use the Sparse 1 descentTramèr & Boneh (2019) for the PGD attack in the 1 constraint.



Figure 1: An overview of our PROTECTOR pipeline. (a) The perturbation classifier C adv correctly categorizes representative attacks of different types. (b) An illustration of the trade-off in Theorem 2, where an adversarial example fooling C adv (the ∞ sample marked in red) becomes weaker to attack the second-level Mp models.

Figure 2: PCA for different types of adversarial examples on MNIST.

Figure 3: Illustration of the effect of random noise for generating adversarial examples. Note that the notion of small and large perturbations is only used to illustrate the scenario in Figure 3, and in general none of the perturbation regions subsumes the other.

y=1 > p(y = -1)p(x|y = -1) p(y = 1)p(x|y = 1) y=-1 < p(y = -1)p(x|y = -1)

003, 12.0, 0.05} respectively. For CIFAR10, the attack step sizes {α ∞ , α 1 , α 2 } = {0.005, 2.0, 0.1} respectively. The training of the individual M p models is directly based on the work of Carmon et al. (2019). E.2 PERTURBATION CLASSIFIER C adv MNIST. We use a learning rate of 0.01 and Adam optimizer for 10 epochs, with linear rate decay to 0.001 between the fourth epoch and the tenth epoch. The batch size is set to 100 for all experiments.

Worst-case accuracies against different p attacks: (a) MNIST; (b) CIFAR-10. Ours represents PROTECTOR against the adaptive attack strategy (Section 5.2), and Ours* is the standard setting. To examine the performance of the perturbation type classification, we evaluate C adv on a dataset of adversarial examples, which are generated against the six models we use as the baseline defenses in our experiments. Note that C adv is only trained on adversarial examples against the two M p models that are part of PROTECTOR. We observe that C adv transfers well across the board. First, C adv generalizes to adversarial examples against new models; i.e., it preserves a high accuracy, even if the adversarial examples are generated against models that are unseen for C adv during training. Further, C adv also generalizes to new attack algorithms. As discussed in Section 5.1, we only include PGD adversarial examples in our training set for C adv . However, on adversarial examples generated by the AutoAttack library, the classification accuracy of C adv still holds up.

Effect of different design choices on the CIFAR-10 dataset against PGD-based attacks. PROTECTOR(n) means the number of specialized robust predictors M p is n in the pipeline.Despite that we evaluate PROTECTOR against a stronger adaptive adversary, in terms of the all attacks accuracy, PROTECTOR still outperforms all baselines by 6.9% on MNIST, and 8.9% on CIFAR-10. Compared to the previous state-of-the-art defense against multiple perturbation types (MSD), the accuracy gain on ∞ attacks is especially notable, i.e., greater than 15%. In particular, if we compare the performance gain on each individual attack algorithm, as shown in Appendix H.2 and H.3 for MNIST and CIFAR-10 respectively, the improvement is also significant, with an average accuracy increase of 15.5% on MNIST, and 14.2% on CIFAR-10. These results demonstrate that PROTECTOR considerably mitigates the trade-off in accuracy against individual attack types.

PROTECTOR: Empirical overlap of p, p attack perturbations in different q, q regions for (a) MNIST ( 1 , 2 , ∞ ) = (10, 2.0, 0.3); (b) CIFAR-10 ( 1 , 2 , ∞ ) = (12, 0.5, 0.03). Each column represents the range (min -max) of q norm for perturbations generated using p PGD attack.

Perturbation type classification accuracy for different perturbation types. Note that the perturbation classifier C adv is only trained on adversarial examples against two M p models. Each column represent the model used to create transfer-based attack via the attack type in the corresponding row. The represented accuracy is an aggregate over 1000 randomly sampled attacks of the ∞ , 2 , 1 types for the corresponding algorithms (and datasets).

Attack-wise breakdown of adversarial robustness on the MNIST dataset. Ours represents the PROTECTOR method against the adaptive attack strategy described in Section 5.2, and Ours* represents the standard attack setting.

Attack-wise breakdown of adversarial robustness on the CIFAR-10 dataset. Ours represents the PROTECTOR method against the adaptive attack strategy described in Section 5.2, and Ours* represents the standard attack setting.

Comparison between using a 'softmax' based aggregation of predictions from different specialized models versus using the prediction from the model corresponding to the most likely attack (only at inference time). Results are presented for APGD 2 , ∞ attacks on the CIFAR10 dataset. AGGREGATING PREDICTIONS FROM DIFFERENT M p AT INFERENCE In all our experiments in this work the adversary constructs adversarial examples using the softmax based adaptive strategy for aggregating predictions from different M p models, as described in Equation

