INTRIGUING CLASS-WISE PROPERTIES OF ADVERSARIAL TRAINING

Abstract

Adversarial training is one of the most effective approaches to improve model robustness against adversarial examples. However, previous works mainly focus on the overall robustness of the model, and the in-depth analysis on the role of each class involved in adversarial training is still missing. In this paper, we provide the first detailed class-wise diagnosis of adversarial training on six widely used datasets, i.e., MNIST, CIFAR-10, CIFAR-100, SVHN, STL-10 and Ima-geNet. Surprisingly, we find that there are remarkable robustness discrepancies among classes, demonstrating the following intriguing properties: 1) Many examples from a certain class could only be maliciously attacked to some specific semantic-similar classes, and these examples will not exist adversarial counterparts in bounded -ball if we re-train the model without those specific classes; 2) The robustness of each class is positively correlated with its norm of classifier weight in deep neural networks; 3) Stronger attacks are usually more powerful for vulnerable classes. Finally, we propose an attack to better understand the defense mechanism of some state-of-the-art models from the class-wise perspective. We believe these findings can contribute to a more comprehensive understanding of adversarial training as well as further improvement of adversarial robustness.

1. INTRODUCTION

The existence of adversarial examples (Szegedy et al., 2014) reveals the vulnerability of deep neural networks, which greatly hinders the practical deployment of deep learning models. Adversarial training (Madry et al., 2018) has been demonstrated to be one of the most successful defense methods by Athalye et al. (2018) . Some researchers (Zhang et al., 2019; Wang et al., 2019b; Carmon et al., 2019; Song et al., 2019) have further improved adversarial training through various techniques. Although these efforts have promoted the progress of adversarial training, the performance of robust models is far from satisfactory. Thus we are eager for some new perspectives to break the current dilemma. We notice that focusing on the differences among classes has achieved great success in the research of noisy label (Wang et al., 2019a) and long-tailed data (Kang et al., 2019) , while researchers in adversarial community mainly concentrate on the overall robustness. A question is then raised: How is the performance of each class in the adversarially robust model? To explore this question, we conduct extensive experiments on six commonly used datasets in adversarial training, i.e., MNIST (LeCun et al., 1998) , CIFAR-10 & CIFAR-100 (Krizhevsky et al., 2009) , SVHN (Netzer et al., 2011 ), STL-10 (Coates et al., 2011 ) and ImageNet (Deng et al., 2009) , and the pipeline of adversarial training and evaluation follows Madry et al. (2018) and Wong et al. (2019) . Figure 1 plots the robustness of each class at different epochs in the test set, where the shaded area in each sub-figure represents the robustness gap between different classes across epochs. Considering the large number of classes in CIFAR-100 and ImageNet, we randomly sample 12 classes for a better indication, and the number of classes in each robustness interval is shown in Appendix A. From Figure 1 , we surprisingly find that there are recognizable robustness gaps between different classes for all datasets. Specifically, for SVHN, CIFAR-10, STL-10 and CIFAR-100, the class-wise robustness gaps are obvious and the largest gaps can reach at 40%-50% (Figure 1 . Even for the simplest dataset MNIST, on which model has achieved more than 95% overall robustness, the largest class-wise robustness gap still has 6% (Figure 1(a) ). Motivated by the above discovery, we naturally raise the following three questions to better investigate the class-wise properties in the robust model: 1) Is there any relations among these different classes as they perform differently? 2) Are there any factors related to the above phenomenon? 3) Is the class-wise performance related to the strength of the attack? We conduct extensive analysis on the obtained robust models and gain the following insights: • Many examples from a certain class could only be maliciously flipped to some specific classes. As long as we remove those specific classes and re-train the model, these examples will not exist adversarial counterparts in bounded -ball. • The robustness of each class is near monotonically related to its norm of classifier weight in deep neural networks. et al., 2019; Wang et al., 2019b) , adding unlabeled data (Carmon et al., 2019; Uesato et al., 2019; Zhai et al., 2019) , and data augmentation (Song et al.,



(b)-1(e)). For &ODVVZLVHUREXVWQHVVLQWHVWVHW

Figure 1: Class-wise robustness at different epochs in test set

In both white-box and black-box settings(Dong et al., 2020), stronger attacks are usually more effective for vulnerable classes (i.e., their robustness is lower than overall robustness).Furthermore, we propose a simple but effective attack called Temperature-PGD attack. It can give us a deeper understanding of how variants of Madry's model work, especially for the robust model with obvious improvement in vulnerable classes(Wang et al., 2019b; Pang et al., 2020). Thus, our work draws the attention of future researchers to watch out the robustness discrepancies among classes.

