INTRIGUING CLASS-WISE PROPERTIES OF ADVERSARIAL TRAINING

Abstract

Adversarial training is one of the most effective approaches to improve model robustness against adversarial examples. However, previous works mainly focus on the overall robustness of the model, and the in-depth analysis on the role of each class involved in adversarial training is still missing. In this paper, we provide the first detailed class-wise diagnosis of adversarial training on six widely used datasets, i.e., MNIST, CIFAR-10, CIFAR-100, SVHN, STL-10 and Ima-geNet. Surprisingly, we find that there are remarkable robustness discrepancies among classes, demonstrating the following intriguing properties: 1) Many examples from a certain class could only be maliciously attacked to some specific semantic-similar classes, and these examples will not exist adversarial counterparts in bounded -ball if we re-train the model without those specific classes; 2) The robustness of each class is positively correlated with its norm of classifier weight in deep neural networks; 3) Stronger attacks are usually more powerful for vulnerable classes. Finally, we propose an attack to better understand the defense mechanism of some state-of-the-art models from the class-wise perspective. We believe these findings can contribute to a more comprehensive understanding of adversarial training as well as further improvement of adversarial robustness. Adversarial training. Adversarial training (Madry et al., 2018) is often formulated as a minmax optimization problem. The inner maximization applies the Projected Gradient Descent (PGD) attack to craft adversarial examples, and the outer minimization uses these examples as augmented data to train the model. Subsequent works are then proposed to further improve adversarial training, including introducing regularization term (

1. INTRODUCTION

The existence of adversarial examples (Szegedy et al., 2014) reveals the vulnerability of deep neural networks, which greatly hinders the practical deployment of deep learning models. Adversarial training (Madry et al., 2018) has been demonstrated to be one of the most successful defense methods by Athalye et al. (2018) . Some researchers (Zhang et al., 2019; Wang et al., 2019b; Carmon et al., 2019; Song et al., 2019) have further improved adversarial training through various techniques. Although these efforts have promoted the progress of adversarial training, the performance of robust models is far from satisfactory. Thus we are eager for some new perspectives to break the current dilemma. We notice that focusing on the differences among classes has achieved great success in the research of noisy label (Wang et al., 2019a) and long-tailed data (Kang et al., 2019) , while researchers in adversarial community mainly concentrate on the overall robustness. A question is then raised: How is the performance of each class in the adversarially robust model? To explore this question, we conduct extensive experiments on six commonly used datasets in adversarial training, i.e., MNIST (LeCun et al., 1998) , CIFAR-10 & CIFAR-100 (Krizhevsky et al., 2009) , SVHN (Netzer et al., 2011) , STL-10 (Coates et al., 2011) and ImageNet (Deng et al., 2009) , and the pipeline of adversarial training and evaluation follows Madry et al. (2018) and Wong et al. (2019) . Figure 1 plots the robustness of each class at different epochs in the test set, where the shaded area in each sub-figure represents the robustness gap between different classes across epochs. Considering the large number of classes in CIFAR-100 and ImageNet, we randomly sample 12 classes for a better indication, and the number of classes in each robustness interval is shown in Appendix A. From Figure 1 , we surprisingly find that there are recognizable robustness gaps between different classes for all datasets. Specifically, for SVHN, CIFAR-10, STL-10 and CIFAR-100, the class-wise robustness gaps are obvious and the largest gaps can reach at 40%-50% (Figure 1 ImageNet, since the model uses the three-stage training method (Wong et al., 2019) , its class-wise robustness gap increases with the training epoch, and finally up to 80% (Figure 1 (f)). Even for the simplest dataset MNIST, on which model has achieved more than 95% overall robustness, the largest class-wise robustness gap still has 6% (Figure 1(a) ). Motivated by the above discovery, we naturally raise the following three questions to better investigate the class-wise properties in the robust model: 1) Is there any relations among these different classes as they perform differently? 2) Are there any factors related to the above phenomenon? 3) Is the class-wise performance related to the strength of the attack? We conduct extensive analysis on the obtained robust models and gain the following insights: • Many examples from a certain class could only be maliciously flipped to some specific classes. As long as we remove those specific classes and re-train the model, these examples will not exist adversarial counterparts in bounded -ball. • The robustness of each class is near monotonically related to its norm of classifier weight in deep neural networks. • In both white-box and black-box settings (Dong et al., 2020) , stronger attacks are usually more effective for vulnerable classes (i.e., their robustness is lower than overall robustness). Furthermore, we propose a simple but effective attack called Temperature-PGD attack. It can give us a deeper understanding of how variants of Madry's model work, especially for the robust model with obvious improvement in vulnerable classes (Wang et al., 2019b; Pang et al., 2020) . Thus, our work draws the attention of future researchers to watch out the robustness discrepancies among classes. 2019). Since adversarial training is more time-consuming than standard training, several methods (Shafahi et al., 2019; Wong et al., 2019) are proposed to accelerate the adversarial training process. Exploring the properties in adversarial training. A lot of researchers try to understand adversarial training from different perspectives. Schmidt et al. (2018) find more data can improve adversarial training. Tsipras et al. (2019) demonstrate that adversarial robustness may be inherently at odds with natural accuracy. Zhang & Zhu (2019) visualize the features of the robust model. Xie & Yuille (2019) explore the scalability. The work of Ortiz-Jimenez et al. (2020) is most relevant to ours. The difference is they focus on the distance from each example to the decision boundary, while we provide some new insights on the role of different classes in adversarial training.

3. EXPLORING THE PROPERTIES AMONG DIFFERENT CLASSES ON ADVERSARIAL TRAINING

Inspired by the phenomenon in Figure 1 , we further conduct the class-wise analysis to explore the properties among different classes for a better understanding on adversarial training. Datasets. We use six benchmark datasets in adversarial training to obtain the corresponding robust model, i.e., MNIST (LeCun et al., 1998) , CIFAR-10 & CIFAR-100 (Krizhevsky et al., 2009) , SVHN (Netzer et al., 2011) , STL-10 (Coates et al., 2011) and ImageNet (Deng et al., 2009) . Table 1 highlights that the classes of CIFAR-10 and STL-10 can be grouped into two superclasses: Transportation and Animals. Similarly, CIFAR-100 also contains 20 superclasses with each has 5 subclasses. See Appendix B for more details of all datasets. Table 1 : Superclasses in CIFAR-10 and STL-10. Dataset Transportation Animals CIFAR-10 Airplane(0) 1 Automobile(1) Ship( 8) Truck( 9) Bird(2) Cat(3) Deer(4) Dog( 5) Frog( 6) Horse( 7) STL-10 Airplane(0) Car( 2) Ship( 8) Truck( 9) Bird(1) Cat( 3) Deer( 4) Dog( 5) Horse( 6) Monkey( 7) 1 The number in brackets represents the numeric label of the class in the dataset. For ImageNet dataset, the pipeline of adversarial training follows Wong et al. (2019) , while the training methods of other datasets follow Madry et al. (2018) . The detailed experimental settings are: MNIST setup. Following Zhang et al. (2019) , we also use a four-layers CNN as the backbone. In the training phase, we adopt the SGD optimizer (Zinkevich et al., 2010) with momentum 0.9, weight decay 2 × 10 -4 and an initial learning rate of 0.01, which is divided by 10 at the 55 th , 75 th and 90 th epoch (100 epochs in total). Both the training and testing attacker are 40-step PGD (PGD 40 ) with random start, maximum perturbation = 0.3 and step size α = 0.01. CIFAR-10 & CIFAR-100 setup. Like Wang et al. (2019b) and Zhang et al. (2019) , we also use ResNet-18 (He et al., 2016) as the backbone. In the training phase, we use the SGD optimizer with momentum 0.9, weight decay 2 × 10 -4 and an initial learning rate of 0.1, which is divided by 10 at the 75 th and 90 th epoch (100 epochs in total). The training and testing attackers are PGD 10 /PGD 20 with random start, maximum perturbation = 0.031 and step size α = 0.007. SVHN & STL-10 setup. All settings are the same to CIFAR-10 & CIFAR-100, except that the initial learning rate is 0.01. ImageNet setup. Following Shafahi et al. (2019) and Wong et al. (2019) , we also use ResNet-50 (He et al., 2016) as the backbone. Specifically, in the training phase, we use the SGD optimizer with momentum 0.9 and weight decay 2 × 10 -4 . A three-stage learning rate schedule is used as the same with Wong et al. (2019) . The training attacker is FGSM (Goodfellow et al., 2015) with random start, maximum perturbation = 0.007, and the testing attacker is PGD 50 with random start, maximum perturbation = 0.007 and step size α = 0.003. For ImageNet dataset, a 14 th epoch model is used to evaluate robustness as it did in Wong et al. (2019) . For other datasets, a 75 th epoch model is used like it did in Madry et al. (2018) . These settings are fixed for all experiments unless otherwise stated. This paper mainly focuses on the adversarial robustness of the model, but the comparisons of the class-wise performance between the robust model and standard model are highlighted in Appendix C for saving space.

3.1. THE RELATIONS AMONG DIFFERENT CLASSES

We first systematically investigate the relation of different classes under robust models. Figure 2 shows the confusion matrices of robustness between classes on all the six datasets. The X-axis and Y-axis represent the predicted classes and the ground truth classes, respectively. The grids on the diagonal line represent the robustness of each class, while the grids on the off-diagonal line represent the non-robustness on one class (Y-axis) to be misclassified to another class (X-axis). Definition 2. (Robust Example) An example is defined as a robust example if it does not exist adversarial counterpart in bounded -ball, saying it would be correctly classified by the model. Definition 3. (Homing Property) Given an adversarial example x from class i which is misclassified as the confound class j by a model, this example satisfies homing property if it becomes a robust example after we re-train the model via removing confound class j. To explore the above question, we conduct extensive experiments on the popular dataset CIFAR-10 as the case study. The results are reported in Figure 3 . Results Analysis. From Figure 4 , we can find that for most classes, their robustness is positively correlated with their norm of classifier weight, i.e., higher (lower) robustness corresponds to a higher (lower) norm of classifier weight. For example, the robustness of class reduced as decreasing of the norm of classifier weight across classes in CIFAR-100 as shown in 4(c). We also check this kind of correlation in standard training, but the experimental results show no significant correlation between the accuracy of each class and its corresponding norm of classifier weight in standard training. The main reason might be that these datasets in standard training are sufficient for most classes to be well trained, while adversarial training always requires abundant data for training (Schmidt et al. (2018) ), hence the insufficient adversarial data cannot guarantee the classifier is well trained and lead to the above experimental observation results.

Conclusion and Suggestion.

The robustness of each class is near monotonically related to its norm of classifier weight. Inspired by this property, we believe that balancing the norm of classifier weight of each class is a possible way to alleviate the different robustness among classes, thereby improving overall model robustness.

3.3. THE CLASS-WISE ROBUSTNESS UNDER DIFFERENT ATTACKS

The above analysis mainly concentrates on the performance under PGD attack. In this section, we investigate the class-wise robustness of state-of-the-art robust models against various popular attacks in the CIFAR-10 dataset. The defense methods we chose include Madry training (Madry et al., 2018) , TRADES (Zhang et al., 2019) , MART (Wang et al., 2019b) and RST (Carmon et al., 2019) . We re-train WideResNet-32-10 (Zagoruyko & Komodakis, 2016) following Madry et al. (2018) . Other defense methods directly use the models released by the authors. White-box attacks include FGSM (Goodfellow et al., 2015) , PGD (Madry et al., 2018) and CW ∞ (Carlini & Wagner, 2017) , and the implementation of CW ∞ follows (Carmon et al., 2019) . Black-box attacks include a transfer-based and a query-based attack (Dong et al., 2020) . The former uses a standard trained WideResNet-32-10 as the substitute model to craft adversarial examples, and the latter uses N atacck (Li et al., 2019) . See Appendix E for the hyperparameters of all attacks. Results Analysis. As shown in Table 2 , in all models and attacks, there are remarkable robustness gaps between different classes, which further verifies our discovery in Figure 1 . Then we try to compare different attacks from a class-wise perspective. Interestingly, stronger attacks in whitebox settings are usually more effective for vulnerable classes, e.g., comparing FGSM and PGD, the robustness reduction of the vulnerable classes (e.g., class 3) is obviously larger than that of robust classes (e.g., class 1). In black-box settings, the main advantage of the query-based attack over the transfer-based attack is also concentrated in vulnerable classes. Additionally, we also notice that class 1 and class 3 are always the most robust and vulnerable class in all settings, which suggests the relative robustness of each class may have a strong correlation with the dataset itself.

Conclusion and Suggestion.

Unbalanced robustness is commonly observed in the defenses of state-of-the-art models against popular attacks, and stronger attacks are usually more powerful for vulnerable classes, e.g., class 3. We hope that future attack or defense works can report the results of class-wise robustness to better understand the proposed methods.

4. TEMPERATURE PGD ATTACK

Some previous work has improved significantly in the vulnerable classes, such as MART (Madry et al., 2018) in Table 2 and HE (Pang et al., 2020) comes from vulnerable classes) in the regular term. HE uses the re-scaled coefficient (s in their paper) to make the misclassified examples provide a larger gradient for the model (more details can be found in Appendix G). Since the misclassified examples have received higher weight during the training process, the classification boundary of each class will be more complex, and the variance of the predicted probability of example x in each class i (i ∈ C) will be smaller, that is, the output distribution will be more smooth as shown in Figure 5 (b)-5(d) (MART). The vanilla PGD attacker may not be able to find the most effective direction in such a smooth probability distribution. Figure 5 (a) and Figure 5 (e) represents the mean value of the variance of the output probability distribution for each class. Specifically, the calculation process of class 0 in the Madry's model is as follows: First, we calculate the variance of the prediction distribution of each examples with ground truth class 0 in CIFAR-10, and then use these variance to obtain the mean value. This value can well reflect the smoothness of the output distribution of each class. Combine information from Table 2 , Figure 5 and Table 5 , the stronger model has a lower variance than the weak model, and the vulnerable class has a lower variance than the robust class. (HE-LIS and HE-CMP are our improved methods based on HE which can boost robustness of vulnerable classes under vanilla PGD attack, see Appendix G for details). In order to find an effective direction in the extremely smooth distribution, we propose to use a temperature factor to change the smooth probability distribution, so as to create virtual power in the possible adversarial direction. For a better formulation, we assume that the DNN is f , the input sample is x, the number of classes in the dataset is C, then the softmax probability of this sample x corresponding to class i (i ∈ C) is S(f (x)) i = e f (x)i/T C k=1 e f (x) k /T (1) Using this improved softmax function, the adversarial perturbation crafted at t th step is δ t+1 = (δ t + α • sign(∇L CE (S(f (x + δ t )), y))) Where is the projection operation, which ensures that δ is in -ball. L CE is the cross-entropy loss. α is the step size. y is the ground truth class. Figure 5 (f)-5(h) is a good example of how Temperature-PGD works. This method allows us to arbitrarily control the absolute value of the gradient, and the idea of creating virtual power may ) -1.0 -0.6 -1.1 -3.2 -5.0 -0.9 -3.3 -0.9 -1.0 -0.7 MART 5 -3.9(54.4) -1.8 -0.9 -3.6 -9.1 -10.2 -2.0 -4.5 -1.4 -3.5 -2.2 RST 5 -2.0(60.9) -1.0 -0.6 -2.2 -2.1 -4.2 -1.5 -5.1 -1.1 -1.2 -0.8 1 "-" represents the robustness reduction compared with the corresponding element of PGD attack in table 2. 2 "+" represents the robustness improvement compared with the corresponding element of PGD attack in table 2 . also be used in targeted attack (Gowal et al., 2019) , that is, manual re-scale gradients may reach the adversarial point faster in a limited number of iterations. The data in Table 3 is the result of comparing the performance of model robustness under Temperature-PGD attack with the vanilla PGD data in Table 2 . We find this method can reduce the overall robustness of the state-of-the-art models by 1.8%-3.9%, where the robustness of the vulnerable class (i.e., class 4) can be reduced by 4.2%-10.2%, which is consistent with our previous findings. Since the model output of Madry's model is relatively more certain (Figure 5 (b)-5(d)), the effectiveness of Temperature-PGD is not obvious. See Appendix F for more ablation studies. In general, Temperature-PGD is a powerful tool for evaluating the defense which explicitly or implicitly use instance-wise information to improve model robustness. More importantly, it can give researchers a new perspective of how variants of Madry's model work. We speculate that the robustness improvement of many current state-of-the-art models may be due to this phenomenon. For example, label-smoothing-based defenses (Goibert & Dohmatob, 2019; Cheng et al., 2020) may not be able to defense Temperature-PGD attack, since these methods explicitly flat the distribution of predicted probabilities.

5. CONCLUSION

In this paper, we conduct a class-wise investigation in adversarial training based on the observation that robustness between each class has a recognizable gap, and reveal three intriguing properties among classes in the robust model: 1) Group-based relations between classes are commonly observed, and many adversarial examples satisfy the homing property. 2) The robustness of each class is positively correlated with its norm of classifier weight. 3) Stronger attacks are usually more effective for vulnerable classes, and we propose an attack to better understand the defense mechanism of some state-of-the-art models from the class-wise perspective. Based on these properties, we propose three promising guidelines for future improvements in model robustness: 1) Deal with different classes by groups. 2) Balance the norm of classifier weight corresponding to each class. 3) Pay more attention to classes suffering from poor robustness. We believe these findings can promote the progress of adversarial training and help to build the state-of-the-art robustness model. 

A THE NUMBER OF CLASSES IN EACH ROBUSTNESS INTERVAL OF CIFAR-100 AND IMAGENET

>@ @ @ @ @ @ @ @ @@ 5REXVWQHVVLQWHUYDO &ODVVQXPEHUV $GYHUVDULDOWHVWLPDJHV (a) CIFAR-100(75-th epoch) >@ @ @ @ @ @ @ @ @@ 5REXVWQHVVLQWHUYDO &ODVVQXPEHUV $GYHUVDULDOWHVWLPDJHV (b) ImageNet(14-th epoch) Figure 6 : Number of classes per robustness interval Due to the large number of classes in CIFAR-100 and ImageNet, we randomly sample 12 classes for analysis in the above paper. For the sake of experimental completeness, the number of classes in different robustness intervals is shown in Figure 6 . Obviously, the robustness of the classes is distributed at multiple intervals, which is consistent with the results shown in Figure 1 .

B INTRODUCTION TO THE DATASETS USED IN THE EXPERIMENT

A variety of datasets are used for research on adversarial training. Here, we introduce in detail the six datasets used in our experiment. MNIST. MNIST (LeCun et al., 1998 ) is a handwritten digit dataset, containing numbers 0 to 9. The dataset consists of 60,000 training images and 10,000 test images, with 6,000 and 1,000 images per digit. All images are fixed size (28×28 pixels) with a value of 0 to 1, and these digits are located in the center of the image. This dataset is widely used in adversarial training (Madry et al., 2018; Zhang et al., 2019; Wang et al., 2019b; Carmon et al., 2019) . CIFAR-10 & CIFAR-100. CIFAR-10 & CIFAR-100 (Krizhevsky et al., 2009) are labeled subsets of the 80 million tiny images dataset (Torralba et al., 2008) . CIFAR-10 consists of 50,000 training images and 10,000 test images in 10 classes, with 5,000 and 1,000 images per class. CIFAR-100 has the same total number of images as CIFAR-10, but it has 100 classes. Thus CIFAR-100 has only 500 training images and 100 test images per class. All images in these two datasets are 32×32 threechannel color images. As mentioned in Section 3, CIFAR-10 can be grouped into 2 superclasses and CIFAR-100 can be grouped into 20 superclasses. CIFAR-10 is the most popular dataset for adversarial training (Dong et al., 2020) and all proposed methods are evaluated in this dataset. CIFAR-100 is more challenging than CIFAR-10, Shafahi et al. (2019) and Song et al. (2019) evaluate their defense methods in CIFAR-100. SVHN. SVHN (Netzer et al., 2011) is similar in flavor to MNIST and both of them contain 10 digits. SVHN contains 73,257 labeled digits for training, 26,032 labeled digits for testing and over 600,000 unlabeled digits images for semi-supervised or unsupervised training. In order to maintain MNIST-like style, SVHN crops the image to a size of 32×32. As a result, many of the images do contain some distractors at the sides. Due to SVHN is obtained from house numbers in the real world, its data distribution is more complicated than MNIST. 

F ABLATION EXPERIMENT OF TEMPERATURE-PGD ATTACK

The data in Table 4 is similar to Table 3 , except that 1/T is different. Combined with the results of Table 3 , we can find the robustness of vulnerable classes (e.g., class 4) in TRADES, MART and RST has a significant decrease in all 1/T settings. When 1/T is set to 2, the overall robustness of the model trained by Madry et al. (2018) is no exception reduced by 0.26%, with the most significant decrease by 2.8% in class 4. Furthermore, the decline of the overall robustness in Madry's model is indeed lower than that in other models, One possible explanation is that these improved robust models may obfuscate gradients Athalye et al. (2018) in vulnerable classes, and theoretical analysis is left to the future. According to the above analysis and learn from the ideas of curriculum learning (Bengio et al., 2009) , we believe the gradient should be gradually enlarged (by increasing 1/T ) during adversarial training, instead of using a fixed 1/T like Pang et al. (2020) . Specifically, the following two schedules to adjust 1/T are proposed: Linear Interpolation Schedule(LIS). We use a simple linear interpolation schedule to adjust 1/T . Therefore, the 1/T of n th epoch is 1 T n = 1 T 0 + n n F I ( 1 T F I - 1 T 0 ) In our implementation, the initial temperature factor 1/T 0 = 1 and the final temperature factor of the interpolation 1/T F I = 75, where the subscript FI is short for final interpolation. We use the above methods to train ResNet-18 on CIFAR-10 and choose vanilla PGD 20 attack to evaluate model robustness. For a fair comparison, we also re-train ResNet18 following Madry et al. (2018) and Pang et al. (2020) . As shown in Table 5 , our methods HE-LIS and HE-CMP have further boosted the model robustness. Especially, the improvement of vulnerable classes (e.g., class 3 and class 4) is impressive. This seems to be an exciting result because the robustness gaps between classes are largely reduced.



As we introduced inTable 1, class 0,1,8,9 belong to superclass Transportation and class 2,3,4,5,6,7 belong to superclass Animals in CIFAR-10. Similarly, class 0,2,8,9 belong to Transportation and class 1,3,4,5,6,7 belong to Animals in STL-10.



(b)-1(e)). For &ODVVZLVHUREXVWQHVVLQWHVWVHW

Figure 1: Class-wise robustness at different epochs in test set

Figure 2: Confusion matrix of robustness in test set

Figure 3: Misclassified and Homing Confusion matrix of CIFAR-10

Figure 4: The correlation between the robustness of each class and its norm of classifier weight

Figure 5: Analysis of class-wise predicted probability distribution: (a) Class-wise variance of predicted probability of Madry, TRADES and MART. (b)-(d) The output probability distribution (Madry: 1/T=1 and MART: 1/T=1) change of image 127 (ground truth class 3) in the process of generating adversarial examples (iteration step 1, 10 and 20). (e) Class-wise variance of predicted probability of HE, HE-LIS and HE-CMP. (f)-(h) The output probability distribution (HE: 1/T=1 and HE: 1/T=10) change of image 46 (ground truth class 3) in the process of generating adversarial examples (iteration step 1, 10 and 20).

Figure 7: Class-wise accuracy in standard training and class-wise robustness in adversarial training

Figure 7 shows the class-wise accuracy in standard training and class-wise robustness in adversarial training. The slashed part in each sub-figure represents the largest gap in accuracy/robustness among different classes, and the classes in the bracket represent the highest and lowest accuracy/robustness class. Note that natural test images are evaluated for standard training, while adversarial test images are used for adversarial training. Results Analysis. As illustrated in Figure 7, the relative order of accuracy/robustness of different classes is almost the same in standard training and adversarial training, but the class-wise gap is enlarged in adversarial training. For example, in CIFAR-10 and SVHN datasets, their largest classwise accuracy gap in standard training are 10.7%(Figure 7(b)) and 5.4%(Figure 7(d)), but these indicators are enlarged to 52.8%(Figure 7(b)) and 37.0%(Figure 7(d)) in adversarial training. In more complex datasets, such as STL10, CIFAR-100 and ImageNet, although standard models also have imbalanced accuracy between classes, these gaps in adversarial training are still larger.Conclusion.The performance discrepancies among classes in the robust model are larger than that of the standard model. e.g., the largest gap in CIFAR-10 is enlarged from 10.7% in the standard model to 52.8% in the robust model.

Figure 9: The relation between the probability and gradient for an adversarial example

The final epoch of the interpolation is equal to the total training epoch n F I = n tot = 100. The training pipeline follows Pang et al. (2020), but we do not use angular margins. Other hyperparameters are the same to Section 3.Control Maximum Probability(CMP).At each epoch, we can accurately calculate the required 1/T according to equation (3) to control the maximum probability. Since equation (3) is a nonlinear function, Powell's dogleg method(Powell, 1970) is used to solve this function. Therefore, the maximum probability P max of all examples in n th epoch is The final epoch of the interpolation n F I = 90, while the total training epoch n tot = 100. For the last 10 epochs, the maximum probability is always controlled at 1. Other settings are the same to Linear Interpolation Schedule.

For a better demonstration, we assume that the feature dimension of the penultimate layer is d and the total number of classes is C. Then the parameter of the last classifier can be represented by a classifier weight W = {w i } ∈ R d×C and a classifier bias b = {b i } ∈ R 1×C , where w i ∈ R d is the classifier weight corresponding to class i. Similar toKang et al. (2019), we calculate the l 2 -norm of the classifier weight w i 2 corresponding to each class i (i ∈ C). To clearly show the relation between the robustness of a class i and its corresponding classifier weight w i , we respectively normalize its robustness and l 2 -norm of w i , and report the results in Figure4.

Adversarial robustness (%) (under popular attacks) on CIFAR-10.

Relative robustness (%) between Temperature-PGD 20 and vanilla PGD 20 .

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael I Jordan.Theoretically principled trade-off between robustness and accuracy. In ICML, 2019.Tianyuan Zhang and Zhanxing Zhu. Interpreting adversarially trained convolutional neural networks. In ICML, pp. 7502-7511, 2019.

Uesato et al. (2019) andZhai et al.  ImageNet. ImageNet (Deng et al., 2009)  is a high-resolution image dataset with 1000 classes. It contains 1,281,167 training images, 50,000 validation images and 100,000 test images. Since the test set has no labels, the validation set is often used to evaluate model robustness. The speed of training a robust model on ImageNet is intolerable, so using this dataset to evaluate model robustness usually requires some accelerated training techniques(Shafahi et al., 2019;Wong et al., 2019).

annex

 2020) combine feature normalization(FN) (Ranjan et al., 2017) , weight normalization(WN) (Guo & Zhang, 2017) and angular margins(AM) (Liu et al., 2016) to propose a defense method that can boost model robustness. In their paper, they believe that using FN and WN to limit embeddings on the hypersphere is the key to robustness improvement, so they call their method hypersphere embedding(HE). However, we find that the factor (temperature factor in our paper) for scaling WN is the real key to improve robustness. We try to understand this from a point of view they overlooked.Our analysis mainly focuses on FN and WN, following Pang et al. (2020) . For a better formulation, we assume the input sample is x, the extracted feature of an example in the penultimate layer is z ∈ R d , the number of classes in the dataset is C, the scale factor is s (corresponds to the 1/T in our paper) and the parameter of the last classifier is W = {w i } ∈ R d×C , where w i ∈ R d is the classifier weight corresponding to class i. Then FN operation is z = z/ z 2 , WN operation is w i = w i / w i 2 and the probability P i of this sample x corresponding to class i (i ∈ C) after the softmax function S isThe gradient of cross-entropy loss L CE corresponding to w i isEquation ( 3) and Equation (4) suggest we can change P i by adjusting T and finally control ∇ wi L CE . e.g., the maximum probability for an example is class l, thenFor better demonstration, suppose that there is a three classification task (Figure 9 ). 

