INTRIGUING CLASS-WISE PROPERTIES OF ADVERSARIAL TRAINING

Abstract

Adversarial training is one of the most effective approaches to improve model robustness against adversarial examples. However, previous works mainly focus on the overall robustness of the model, and the in-depth analysis on the role of each class involved in adversarial training is still missing. In this paper, we provide the first detailed class-wise diagnosis of adversarial training on six widely used datasets, i.e., MNIST, CIFAR-10, CIFAR-100, SVHN, STL-10 and Ima-geNet. Surprisingly, we find that there are remarkable robustness discrepancies among classes, demonstrating the following intriguing properties: 1) Many examples from a certain class could only be maliciously attacked to some specific semantic-similar classes, and these examples will not exist adversarial counterparts in bounded -ball if we re-train the model without those specific classes; 2) The robustness of each class is positively correlated with its norm of classifier weight in deep neural networks; 3) Stronger attacks are usually more powerful for vulnerable classes. Finally, we propose an attack to better understand the defense mechanism of some state-of-the-art models from the class-wise perspective. We believe these findings can contribute to a more comprehensive understanding of adversarial training as well as further improvement of adversarial robustness.

1. INTRODUCTION

The existence of adversarial examples (Szegedy et al., 2014) reveals the vulnerability of deep neural networks, which greatly hinders the practical deployment of deep learning models. Adversarial training (Madry et al., 2018) has been demonstrated to be one of the most successful defense methods by Athalye et al. (2018) . Some researchers (Zhang et al., 2019; Wang et al., 2019b; Carmon et al., 2019; Song et al., 2019) have further improved adversarial training through various techniques. Although these efforts have promoted the progress of adversarial training, the performance of robust models is far from satisfactory. Thus we are eager for some new perspectives to break the current dilemma. We notice that focusing on the differences among classes has achieved great success in the research of noisy label (Wang et al., 2019a) and long-tailed data (Kang et al., 2019) , while researchers in adversarial community mainly concentrate on the overall robustness. A question is then raised:

How is the performance of each class in the adversarially robust model?

To explore this question, we conduct extensive experiments on six commonly used datasets in adversarial training, i.e., MNIST (LeCun et al., 1998) , CIFAR-10 & CIFAR-100 (Krizhevsky et al., 2009) , SVHN (Netzer et al., 2011 ), STL-10 (Coates et al., 2011 ) and ImageNet (Deng et al., 2009) , and the pipeline of adversarial training and evaluation follows Madry et al. (2018) and Wong et al. (2019) . Figure 1 plots the robustness of each class at different epochs in the test set, where the shaded area in each sub-figure represents the robustness gap between different classes across epochs. Considering the large number of classes in CIFAR-100 and ImageNet, we randomly sample 12 classes for a better indication, and the number of classes in each robustness interval is shown in Appendix A. From Figure 1 , we surprisingly find that there are recognizable robustness gaps between different classes for all datasets. Specifically, for SVHN, CIFAR-10, STL-10 and CIFAR-100, the class-wise robustness gaps are obvious and the largest gaps can reach at 40%-50% (Figure 1 (b)-1(e)). For

