PERTURBATION TYPE CATEGORIZATION FOR MULTI-PLE p BOUNDED ADVERSARIAL ROBUSTNESS

Abstract

Despite the recent advances in adversarial training based defenses, deep neural networks are still vulnerable to adversarial attacks outside the perturbation type they are trained to be robust against. Recent works have proposed defenses to improve the robustness of a single model against the union of multiple perturbation types. However, when evaluating the model against each individual attack, these methods still suffer significant trade-offs compared to the ones specifically trained to be robust against that perturbation type. In this work, we introduce the problem of categorizing adversarial examples based on their p perturbation types. Based on our analysis, we propose PROTECTOR, a two-stage pipeline to improve the robustness against multiple perturbation types. Instead of training a single predictor, PROTECTOR first categorizes the perturbation type of the input, and then utilizes a predictor specifically trained against the predicted perturbation type to make the final prediction. We first theoretically show that adversarial examples created by different perturbation types constitute different distributions, which makes it possible to distinguish them. Further, we show that at test time the adversary faces a natural trade-off between fooling the perturbation type classifier and the succeeding predictor optimized with perturbation specific adversarial training. This makes it challenging for an adversary to plant strong attacks against the whole pipeline. In addition, we demonstrate the realization of this trade-off in deep networks by adding random noise to the model input at test time, enabling enhanced robustness against strong adaptive attacks. Extensive experiments on MNIST and CIFAR-10 show that PROTECTOR outperforms prior adversarial training based defenses by over 5%, when tested against the union of 1 , 2 , ∞ attacks. 1 

1. INTRODUCTION

There has been a long line of work studying the vulnerabilities of machine learning models to small changes in the input data. In particular, most existing works focus on p bounded perturbations (Szegedy et al., 2013; Goodfellow et al., 2015) . While majority of the prior work aims at achieving robustness against a single perturbation type (Madry et al., 2018; Kurakin et al., 2017; Tramèr et al., 2018; Dong et al., 2018; Zhang et al., 2019; Carmon et al., 2019) , real-world deployment of machine learning models requires them to be robust against various imperceptible changes in the input, irrespective of the attack type. Prior work has shown that when models are trained to be robust against one perturbation type, such robustness typically does not transfer to attacks of a different type (Schott et al., 2018; Kang et al., 2019) . As a result, recent works have proposed to develop models that are robust against the union of multiple perturbation types (Tramèr & Boneh, 2019; Maini et al., 2020) . Specifically, these works consider adversaries limited by their p distance from the original input for p ∈ {1, 2, ∞}. While these methods improve the overall robustness against multiple perturbation types, when evaluating the robustness against each individual perturbation type, the robustness of models trained by these methods is still considerably worse than those trained on a single perturbation type. Further, these methods are found sensitive to small changes in hyperparameters. In this work, we propose an alternative view that does not require a single predictor to be robust against a union of perturbation types. Instead, we propose to utilize a union of predictors to improve the



We will open-source the code, pre-trained models, and perturbation type datasets upon publication.1

