INEQUALITY PHENOMENON IN l ∞ -ADVERSARIAL TRAINING, AND ITS UNREALIZED THREATS

Abstract

The appearance of adversarial examples raises attention from both academia and industry. Along with the attack-defense arms race, adversarial training is the most effective against adversarial examples. However, we find inequality phenomena occur during the l ∞ -adversarial training, that few features dominate the prediction made by the adversarially trained model. We systematically evaluate such inequality phenomena by extensive experiments and find such phenomena become more obvious when performing adversarial training with increasing adversarial strength (evaluated by ϵ). We hypothesize such inequality phenomena make l ∞ -adversarially trained model less reliable than the standard trained model when the few important features are influenced. To validate our hypothesis, we proposed two simple attacks that either perturb important features with noise or occlusion. Experiments show that l ∞ -adversarially trained model can be easily attacked when a few important features are influenced. Our work sheds light on the limitation of the practicality of l ∞ -adversarial training.

1. INTRODUCTION

discovered adversarial examples of deep neural networks (DNNs), which pose significant threats to deep learning-based applications such as autonomous driving and face recognition. Prior to deploying DNN-based applications in real-world scenarios safely and securely, we must defend against adversarial examples. After the emergence of adversarial examples, several defensive strategies have been proposed (Guo et al., 2018; Prakash et al., 2018; Mummadi et al., 2019; Akhtar et al., 2018) . By retraining adversarial samples generated in each training loop, adversarial training (Goodfellow et al., 2015; Zhang et al., 2019; Madry et al., 2018b) is regarded as the most effective defense against adversarial attacks. The most prevalent adversarial training is l ∞ adversarial training, which applies adversarial samples with l ∞ bounded perturbation by ϵ. Numerous works have been devoted to theoretical and empirical comprehension of adversarial training (Andriushchenko & Flammarion, 2020; Allen-Zhu & Li, 2022; Kim et al., 2021) . For example, Ilyas et al. (2019) proposed that an adversarially trained model (robust model for short) learns robust features from adversarial examples and discards non-robust ones. Engstrom et al. (2019) also proposed that adversarial training forces the model learning to be invariant to features to which humans are also invariant. Therefore, adversarial training results in robust models' feature representations that are more comparable to humans. Theoretically validated by Chalasani et al. (2020) , the l ∞ -adversarial training suppresses the significance of the redundant features, and the robust model, therefore, has sparser and better-behaved feature representations than the standard trained model. In general, previous research indicates that robust models have a sparse representation of features and view such sparse representation as advantageous because it is more human-aligned. Several works investigate this property of robust models and attempt to transfer such feature representation to a standard trained model using various methods (Ross & Doshi-Velez, 2018; Salman et al., 2020; Deng et al., 2021) . However, contrary to the claim of previous work regarding such sparse feature representation as an advantage, we find that such sparseness also indicates inequality phenomena (see Section 3.1 for detailed explanation) that may pose unanticipated threats to l ∞ -robust models. During l ∞adversarial training, the model not only suppresses the redundant features (Chalasani et al., 2020) but also suppresses the importance of other features including robust ones. The degree of suppression is proportional to the adversarial attack budget (evaluated by ϵ). Hence, given the input images for an l ∞ -robust model, only a handful of features dominate the prediction. Intuitively, standardtrained models make decisions based on various features, and some redundant features serve as a "bulwark" when a few crucial features are accidentally distorted. However, in the case of a l ∞ robust model, the decision is primarily determined by a small number of characteristics, so the prediction is susceptible to change when these significant characteristics are modified (see Figure 1 ). As shown in Figure 1 , an l ∞ -robust model recognizes a street sign using very few regions of the sign. Even with very small occlusions, the robust model cannot recognize a street sign if we obscure the region that the model considers to be the most important (but well recognized by humans and the standardtrained model). Even if an autonomous vehicle is deployed with a robust model that achieves high adversarial robustness against worst-case adversarial examples, it will still be susceptible to small occlusions. Thus, the applicability of such a robust model is debatable. Figure 1 : l ∞ -robust model fails to recognize street sign with small occlusions. With given feature attribution maps that attribute the importance of each pixel, we occlude the image's pixels of high importance with small patches. The resultant image fools the robust model successfully. We notice prior works (Tsipras et al.) showed that feature attribution maps of robust models are perceptually aligned. For clarity we strongly suggest the readers check Appendix A.2. ) In this work, we name such a phenomenon that only a few features are extremely crucial for models' recognition as "inequality phenomenon". we study the inequality from two aspects: 1) global inequality: characterized by the dominance of a small number of pixels. 2) regional inequality: characterized by the tendency of pixels deemed significant by the model to cluster in particular regions. We analyze such phenomena on ImageNet-and CIFAR10-trained models with various architectures. We further devise attacks to expose the vulnerabilities resulting from such inequality based on our findings. Experiments demonstrate that under the premise that human observers can recognize the resulting images, l ∞ -robust models are significantly more susceptible than the standard-trained models. Specifically, they are susceptible to occlusion and noise with error rates of 100% and 94% respectively, whereas standard-trained models are only affected by 30.1% and 34.5%. In summary, our contribution can be summed up in the following manner: • We identify the occurrence of the inequality phenomenon during l ∞ -adversarial training. We design correlative indices and assess such inequality phenomena from various perspectives (global and regional). We systematically evaluate such phenomena by conducting extensive experiments on broad datasets and models. • Then, we identify unrealized threats posed by such inequality phenomena that l ∞ -robust models are much more vulnerable than standard trained ones under inductive noise or occlusion. In this case, during the l ∞ -adversarial training, the adversarial robustness is achieved at the expense of another more practical robustness. • Our work provides an intuitive understanding of the weakness of l ∞ -robust model's feature representation from a novel perspective. Moreover, our work sheds light on the limitation and the hardness of l ∞ -adversarial training.

