PERTURBATION TYPE CATEGORIZATION FOR MULTI-PLE p BOUNDED ADVERSARIAL ROBUSTNESS

Abstract

Despite the recent advances in adversarial training based defenses, deep neural networks are still vulnerable to adversarial attacks outside the perturbation type they are trained to be robust against. Recent works have proposed defenses to improve the robustness of a single model against the union of multiple perturbation types. However, when evaluating the model against each individual attack, these methods still suffer significant trade-offs compared to the ones specifically trained to be robust against that perturbation type. In this work, we introduce the problem of categorizing adversarial examples based on their p perturbation types. Based on our analysis, we propose PROTECTOR, a two-stage pipeline to improve the robustness against multiple perturbation types. Instead of training a single predictor, PROTECTOR first categorizes the perturbation type of the input, and then utilizes a predictor specifically trained against the predicted perturbation type to make the final prediction. We first theoretically show that adversarial examples created by different perturbation types constitute different distributions, which makes it possible to distinguish them. Further, we show that at test time the adversary faces a natural trade-off between fooling the perturbation type classifier and the succeeding predictor optimized with perturbation specific adversarial training. This makes it challenging for an adversary to plant strong attacks against the whole pipeline. In addition, we demonstrate the realization of this trade-off in deep networks by adding random noise to the model input at test time, enabling enhanced robustness against strong adaptive attacks. Extensive experiments on MNIST and CIFAR-10 show that PROTECTOR outperforms prior adversarial training based defenses by over 5%, when tested against the union of 1 , 2 , ∞ attacks. 1 

1. INTRODUCTION

There has been a long line of work studying the vulnerabilities of machine learning models to small changes in the input data. In particular, most existing works focus on p bounded perturbations (Szegedy et al., 2013; Goodfellow et al., 2015) . While majority of the prior work aims at achieving robustness against a single perturbation type (Madry et al., 2018; Kurakin et al., 2017; Tramèr et al., 2018; Dong et al., 2018; Zhang et al., 2019; Carmon et al., 2019) , real-world deployment of machine learning models requires them to be robust against various imperceptible changes in the input, irrespective of the attack type. Prior work has shown that when models are trained to be robust against one perturbation type, such robustness typically does not transfer to attacks of a different type (Schott et al., 2018; Kang et al., 2019) . As a result, recent works have proposed to develop models that are robust against the union of multiple perturbation types (Tramèr & Boneh, 2019; Maini et al., 2020) . Specifically, these works consider adversaries limited by their p distance from the original input for p ∈ {1, 2, ∞}. While these methods improve the overall robustness against multiple perturbation types, when evaluating the robustness against each individual perturbation type, the robustness of models trained by these methods is still considerably worse than those trained on a single perturbation type. Further, these methods are found sensitive to small changes in hyperparameters. In this work, we propose an alternative view that does not require a single predictor to be robust against a union of perturbation types. Instead, we propose to utilize a union of predictors to improve the overall robustness, where each predictor is specialized to defend against certain perturbation types. In particular, we introduce the problem of categorizing adversarial examples based on their perturbation types. Based on this idea, we propose PROTECTOR, a two-stage pipeline that performs Perturbation Type Categorization for Robustness against multiple perturbations. Specifically, first a perturbation type classifier predicts the type of the attack. Then, among the second-level predictors, PROTECTOR selects the one that is the most robust to the predicted perturbation type to make final prediction. We validate our approach from both theoretical and empirical aspects. First, we present theoretical analysis to show that for benign samples with the same ground truth label, their distributions become highly distinct when added with different types of perturbations, and thus can be separated. Further, we show that there exists a natural tension between attacking the top-level perturbation classifier and the second-level predictors -strong attacks against the second-level predictors make it easier for the perturbation classifier to predict the adversarial perturbation type, and fooling the perturbation classifier requires planting weaker (or less representative) attacks against the second-level predictors. As a result, even an imperfect perturbation classifier is sufficient to significantly improve the overall robustness of the model to multiple perturbation types. Empirically, we show that the perturbation type classifier generalizes well on classifying adversarial examples against different adversarially trained models. Then we further compare PROTECTOR to the state-of-the-art defenses against multiple perturbations on MNIST and CIFAR-10. PROTECTOR outperforms prior approaches by over 5% against the union of the 1 , 2 and ∞ attacks. While past work has focused on the worst case metric against all attacks, on average they suffer significant tradeoffs against individual attacks. From the suite of 25 different attacks tested, the average improvement for PROTECTOR over all the attacks w.r.t. the state-of-art baseline defense is ∼ 15% on both MNIST and CIFAR10. In particular, by adding random noise to the model input at test time, we further increase the tension between attacking top-level and second-level components, and bring in additional improvement of robustness against adaptive attackers. Additionally, PROTECTOR provides a modular way to integrate and update defenses against a single perturbation type.

2. RELATED WORK

Adversarial examples. The realization of the existence of adversarial examples in deep neural networks has spun active research on attack algorithms and defense proposals (Szegedy et al., 2013) . Among different types of attacks (Madry et al., 2018; Hendrycks et al., 2019; Hendrycks & Dietterich, 2019; Bhattad et al., 2020) , the most commonly studied ones constrain the adversarial perturbation within an p region of radius p around the original input. To improve the model robustness in the presence of such adversaries, the majority of existing defenses utilize adversarial training (Goodfellow et al., 2015) , which augments the training dataset with adversarial images. Till date, different variants of the original adversarial training algorithm remain the most successful defenses against adversarial attacks (Carmon et al., 2019; Zhang et al., 2019; Wong et al., 2020; Rice et al., 2020) . Other types of defenses include input transformation (Guo et al., 2018; Buckman et al., 2018) and network distillation (Papernot et al., 2016) , but were rendered ineffective under stronger adversaries (He et al., 2017; Carlini & Wagner, 2017a; Athalye et al., 2018; Tramer et al., 2020) 



We will open-source the code, pre-trained models, and perturbation type datasets upon publication.



. Other works have explored the relation between randomizing the inputs and adversarial examples. Tabacof & Valle (2016) analyzed the change in adversarial robustness with varying levels of noise. Hu et al. (2019) evaluated the robustness of a data point to random noise to detect adversarial examples, whereas Cohen et al. (2019) utilized randomized smoothing for certified robustness to adversarial attacks. Defenses against multiple perturbation types. Recent research has been drawn towards the goal of universal adversarial robustness. Since p -norm bounded attacks are amongst the strongest attacks in adversarial examples literature, defending against a union of such attacks is an important step towards this end goal. Schott et al. (2018); Kang et al. (2019) showed that models that were trained for a given p -norm bounded attacks are not robust against attacks in a different q region. Succeeding work has aimed at developing one single model that is robust against the union of multiple perturbation types. Schott et al. (2018) proposed the use of multiple variational autoencoders to achieve robustness to multiple p attacks on the MNIST dataset. Tramèr & Boneh (2019) used simple aggregations of multiple adversaries to achieve non-trivial robust accuracy against the union of the 1 , 2 , ∞ regions. Maini et al. (2020) proposed the MSD algorithm that takes gradient steps in the union of multiple p regions to improve multiple perturbation robustness. In a related line of work, Croce & Hein (2020a)

