ELIMINATING CATASTROPHIC OVERFITTING VIA AB-NORMAL ADVERSARIAL EXAMPLES REGULARIZA-TION

Abstract

Single-step adversarial training (SSAT) is shown to be able to defend against iterative-step adversarial attacks to achieve both efficiency and robustness. However, SSAT suffers from catastrophic overfitting (CO) with strong adversaries, showing that the classifier decision boundaries are highly distorted and robust accuracy against iterative-step adversarial attacks suddenly drops from peak to nearly 0% in a few epochs. In this work, we find that some adversarial examples generated on the network trained by SSAT exhibit anomalous behaviour, that is, although the training data is generated by the inner maximization process, the loss of some adversarial examples decreases instead, which we called abnormal adversarial examples. Furthermore, network optimization on these abnormal adversarial examples will further accelerate the model decision boundaries distortion, and correspondingly, the number of abnormal adversarial examples will sharply increase with CO. These observations motivate us to eliminate CO by hindering the generation of abnormal adversarial examples. Specifically, we design a novel method, Abnormal Adversarial Examples Regularization (AAER), which explicitly regularizes the number and outputs variation of abnormal adversarial examples to hinder the model from generating abnormal adversarial examples. Extensive experiments demonstrate that our method can eliminate CO and further boost adversarial robustness with strong adversaries.

1. INTRODUCTION

In recent years, Deep Neural Networks (DNNs) have performed impressively in various fields, such as autonomous driving (Litman, 2017) , face recognition (Sharif et al., 2016) and medical imaging diagnosis (Buch et al., 2018) . However, DNNs were found to be vulnerable to adversarial examples (Szegedy et al., 2013) . Although these adversarial examples are imperceptible to the human eyes, they can lead to a completely different prediction in DNNs. To this end, many adversarial defense methods have been proposed, such as verification and provable defense (Katz et al., 2017) , preprocessing techniques (Guo et al., 2017) , detection algorithms (Metzen et al., 2017) and adversarial training (AT) (Goodfellow et al., 2014) . Among them, AT is considered to be one of the most effective methods against adversarial attacks (Athalye et al., 2018) . However, standard iterativestep AT significantly increases computational overhead due to multiple steps forward and backward propagation. Therefore, some works attempt to improve the vanilla single-step adversarial training (SSAT) to defend against iterative-step adversarial attacks while maintaining efficiency and robustness. Unfortunately, a serious problem -catastrophic overfitting (CO) -occurs with stronger adversaries (Wong et al., 2020) . This strange phenomenon means that the robust accuracy of the model against the iterative-step adversarial attack suddenly from peak drops to nearly zero during a few epochs, as shown in Figure 1 . This intriguing phenomenon has been widely investigated and led to many works to resolve CO. Recently, Kim et al. (2021) points out that networks in which CO occurs are generally accompanied by highly distorted decision boundaries. However, the interaction between distorted decision boundaries and CO remains unclear. In this work, we delve into the dynamic effects between CO and decision boundaries distortion. Specifically, we find some adversarial examples generated on the network with distorted decision boundaries exhibit anomalous behavior, that is, although all training samples are generated by the inner maximization process, the loss of some adversarial examples decreases instead. We refer to these training samples as abnormal adversarial examples. To make matters worse, the decision boundaries distortion will further exacerbate by optimising the classifier directly on these abnormal adversarial examples, and the number of abnormal adversarial examples will sharply increase as a result, which leads to a vicious circle between the number of abnormal adversarial example and the decision boundaries distortion. All these atypical findings raise a question:

Can CO be prevented by hindering the generation of abnormal adversarial examples?

To answer the above question, we design a novel method, Abnormal Adversarial Examples Regularization (AAER), which incorporates a regularizer that prevents CO by suppressing generated abnormal adversarial examples. Specifically, AAER consists of two key components: (i) the number and (ii) outputs variation of abnormal adversarial examples. The first part (i) counts the sample number by dividing the training samples into groups of normal and abnormal adversarial examples through anomalous loss decrease behavior. The second part (ii) contains prediction confidence and logits variation, and calculates these two variations differences between the two groups of samples by cross-entropy and Euclidean distance, respectively. Then, AAER explicitly regularizes the number and outputs variation of abnormal adversarial examples by these two parts to hinder the model from generating abnormal adversarial examples. Extensive experiments show that our method can well eliminate CO and further improve the adversarial robustness. It is worth noting that our method does not involve the extra generation and backward propagation process, which brings us unparalleled convenience in computational overhead. Our major contributions are summarized as follows: • We found some training samples exhibit anomalous loss variation during the inner maximization process. Besides, the number of abnormal adversarial examples will sharply increase with CO, and the model will further exacerbate by optimising the classifier directly on these abnormal adversarial examples. • Based on the observed effect, we propose a novel method -Abnormal Adversarial Examples Regularization (AAER), which explicitly regularizes the number of abnormal adversarial examples and their anomalous outputs variation to hinder the generation of abnormal adversarial examples. Extensive experiments demonstrate that our method can prevent CO and automatically adapt to different noise magnitudes without hyperparameter tuning. • We evaluate the effectiveness of our method across different adversarial budgets, adversarial attacks, datasets and network architectures, showing that our proposed method consistently achieves state-of-the-art robust accuracy in SSAT and can obtain comparable robustness to standard iterative-step AT with only negligible computational overhead.

2.1. ADVERSARIAL TRAINING

Adversarial training has been demonstrated to be the most effective method for defending against adversarial attacks (Athalye et al., 2018) . AT is generally formulated as a min-max optimization problem (Madry et al., 2017) , the inner maximization problem tries to generate the strongest adversarial examples to maximise the loss, and the outer minimization problem tries to optimize the network to minimize the loss on adversarial examples. However, the inner maximization problem is a NP-hard problem. Therefore, AT uses a simple gradient ascent to generate perturbations to find local approximate solution, and can be formalized as a min-max optimization problem as follows: min θ E (x,y)∼D max δ∈∆ ℓ(x + δ, y; θ) , where (x, y) is the training dataset from the distribution D, ℓ(x, y; θ) is the loss function parameterized by θ, δ is the perturbation confined within the boundary ϵ with L p -norm distance, shown as: ∆ = {δ : ∥δ∥ p ≤ ϵ}. The common threat models are L 1 , L 2 and L ∞ , in this work we chose L ∞ for our threat model. Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2014 ) is a single-step adversarial attack method, which uses the sign of the gradient to find the perturbation, as shown in Eq. ( 2): δ F GSM = ϵ • sign (∇ x ℓ(x, y; θ)) . Fast Training (RS-FGSM) (Wong et al., 2020) adds uniform random initialization η before generating the perturbation, and uses the over-perturbation step size α = 1.25 • ϵ: η = Uniform(-ϵ, ϵ), δ RS-F GSM = α • sign (∇ x+η ℓ(x + η, y; θ)) . (3) Iterative Fast Gradient Sign Method (I-FGSM) (Kurakin et al., 2018) is an iterative-step version of FGSM that uses multiple gradients to find stronger perturbations. With a smaller step size α = ϵ/N and the number of iterations T , I-FGSM can be formulated as follows: δ T I-F GSM = α • sign ∇ x+δ T -1 ℓ(x + δ T -1 , y; θ) . Projected Gradient Descent (PGD) (Madry et al., 2017) adds uniform random initialization on the basis of I-FGSM.

2.2. CATASTROPHIC OVERFITTING

Since Wong et al. (2020) found the CO phenomenon, there has been a line of work trying to explore and mitigate this problem. Vivek & Babu (2020b) empirically showed that adding a dropout layer after all non-linear layers can avoid early overfitting to FGSM. de Jorge et al. (2022) found that augmenting the perturbations by increasing the noise initialization magnitude and removing the perturbation boundaries can eliminate CO. Li et al. (2022) successfully prevents CO by constraining training samples in a carefully extracted subspace to avoid the abrupt growth of gradient. Other works attempt to prevent CO by strengthening the inner maximization processes. Kim et al. (2021) assumed that CO is caused by fixed FGSM perturbation magnitude and reduces the perturbed step size for misclassified adversarial examples. Golgooni et al. (2021) argued that small gradients play a key role in CO and ignore small gradient information to avoid huge weight updates. Huang et al. (2022) discovered that fitting instances with a larger gradient norm are more likely to cause CO and learning an instance-adaptive step size is inversely proportional to its gradient norm. Park & Lee (2021) leverages the gradients of latent representation as the latent adversarial perturbation to compensate for local linearity. Similar to our work, some works add a regularization term on the loss value to explicitly prevent CO. Andriushchenko & Flammarion (2020) found that PGD and FGSM perturbations are orthogonal when CO occurs, hence they proposed a regularization term to encourage the gradient alignment. Vivek & Babu (2020a) proposed a regularization term to mitigate the CO by harnessing properties that differentiate a robust model from that of a pseudo-robust model. Sriramanan et al. (2021) introduced a relaxation term to find more suitable gradient directions for the attack by smoothing the loss surface. Chen et al. (2021) demonstrated that the negative high-order terms lead to a perturbation loss distortion phenomenon that will cause CO, and they proposed a regularization term to make the loss surface flat.

3. PROPOSED APPROACH

In this section, we first define abnormal adversarial example and show how their number change during CO (Section 3.1). We further analyse the outputs variation of normal and abnormal adversarial examples and find that they exhibit significantly different magnitudes of outputs variation after CO (Section 3.2). Based on our observations, we propose a novel regularization term, AAER, using the number and outputs variation of abnormal adversarial examples to explicitly suppress the generation of these anomalous training samples to eliminate CO (Section 3.3).

3.1. DEFINITION AND COUNTING THE ABNORMAL ADVERSARIAL EXAMPLE

Adversarial training employs the most adversarial data to reduce the sensitivity of the network's output w.r.t. adversarial perturbation of the natural data. Therefore, we expect the inner maximization process can generate effective adversarial examples that can maximize the classification loss. However, Kim et al. (2021) shows that the decision boundaries of the classifier will be highly distorted accompanied by the occurrence of CO. After adding the adversarial perturbation which is generated on this distorted classifier, the classification loss of some training samples is atypically reduced. As shown in Figure 2 , it can be seen that, for some samples (blue), they will misclassify the model or be closer to the decision boundary after the inner maximization process, and for some other samples (red), they are farther to the decision boundary after adding the perturbation generated by the distorted classifier, which we called abnormal adversarial example. These abnormal adversarial examples generally fail to mislead the classifier. Thus, we can define abnormal adversarial examples using the following formula: at a very large number. Given this observation, we can infer that there is a close correlation between the number of abnormal adversarial examples and the CO phenomenon, which also prompts us to wonder (Q1): whether CO can be mitigated by reducing the number of abnormal adversarial examples. δ = α • sign (∇ x+η ℓ(x + η, y; θ)) , δ Abnormal def = ℓ (x + η, y; θ) > ℓ (x + η + δ, y; θ) .

3.2. OUTPUTS VARIATION OF NORMAL AND ABNORMAL ADVERSARIAL EXAMPLE

The above observations indicate that CO and the number of abnormal adversarial examples are closely related. In this part, we further analyze the outputs variation of normal and abnormal adversarial examples during CO. Specifically, we analyze the outputs into two categories: prediction confidence and logits, and use the cross-entropy to calculate the prediction confidence variation during the inner maximization process, which is formulated as follows: ℓ (x + η + δ, y; θ) -ℓ (x + η, y; θ) . From Figure 3 (middle), we can observe that the change in the prediction confidence of normal adversarial samples is greater than 0, indicating that the perturbation leads to worse predictions. However, the variation of abnormal adversarial examples is atypical negative, meaning that the perturbation has the opposite effect as we expected. Furthermore, we analyze the prediction confidence variation of abnormal adversarial examples during training. Before the occurrence of CO, we can observe that the prediction confidence variation of abnormal adversarial examples is slightly less than zero, and the negative impact on all training samples (blue line) is not significant. During the occurrence of CO, their prediction confidence variation decreases rapidly slumped 27 times at the 10th epoch, and deterministically effect the prediction confidence of all training samples. Furthermore, we compare the magnitude of the outputs change between normal and abnormal adversarial examples, and use the Euclidean distance (L2 distance) to calculate the sample logits variation during the inner maximization process, which is formulated as follows: ∥f θ (x + η + δ) -f θ (x + η) ∥ 2 2 , where f θ is the DNN classifier parameterized by θ and ∥ • ∥ 2 2 is the L2 distance. The magnitude of the logits variation of normal and abnormal adversarial examples is shown in Figure 3 (right) . We can observe that the logits variation magnitude of abnormal adversarial examples increases dramatically during CO, which is 16 times larger than that before CO. A single-step gradient ascent can bring an earth-shaking change in the output logits, which generally happens on highly distorted decision boundaries. Additionally, we observed that the logits variation magnitude of the normal adversarial examples (green line) increases one epoch later than the abnormal ones, which indicates that the model boundary distortion mainly lies in the abnormal adversarial examples, in other words, directly optimizing the network with these abnormal adversarial examples will further exacerbate the model boundary distortion. Moreover, we further compare the magnitude of logits variation for normal and abnormal adversarial examples. From Figure 3 (right), we can observe that the logits variation magnitude on normal and abnormal adversarial examples is similar before CO. However, there is a significant difference in the logits variation magnitude between these two types of examples after CO. It is observed that the logits variation magnitude in abnormal adversarial examples is 4 times that in normal ones at the 10th epoch. There are significant differences in the magnitude of both prediction confidence and logits variation between normal and abnormal adversarial examples, which inspires us to wonder (Q2): whether CO can be mitigated by constraining the outputs variation of abnormal adversarial examples.

3.3. ABNORMAL ADVERSARIAL EXAMPLES REGULARIZATION TERM

We answer these two questions through three optimization objectives. To answer the Q1, the first part (i) uses the Eq. 5 to divide the training samples into normal and abnormal adversarial examples, and then penalize the number of abnormal adversarial examples. To answer the Q2, the second part (ii) and the third part (iii) constrain the outputs variation of abnormal adversarial examples. Specifically, the second part (ii) calculates the prediction confidence variation of abnormal adversarial examples, and then penalizes this variation that should not decrease during the inner maximization process, which is formalized as follows: 1 n n j=1 ℓ x Abnormal j + η, y j ; θ -ℓ x Abnormal j + η + δ, y j ; θ , ( ) where n is the number of abnormal adversarial examples. The third part (iii) calculates the logits variation of normal and abnormal adversarial examples. Since the logits variation is a representation of the change magnitude, which is not related to the label, there is no clear target value for the optimization standard. Therefore, we use the logits variation of normal adversarial examples as the standard and explicitly make them logits variation closer. In order to avoid the network only focusing on increasing the logits variation of abnormal adversarial examples instead of reducing the abnormal ones, we use the max function to limit the minimum value to 0, which is formalized as follows: max   1 n n j=1 ∥f θ x Abnormal j + η + δ -f θ x Abnormal j + η ∥ 2 2 - 1 m -n m-n k=1 ∥f θ x N ormal k + η + δ -f θ x N ormal k + η ∥ 2 2 , 0 , where m is the number of training samples and max(, ) is the max function. Based on the above analysis, we design a novel regularization term, AAER, which aims to suppress the abnormal adversarial examples by (i) the number, (ii) the prediction confidence variation and (iii) the logits variation, ultimately achieving the purpose of preventing CO, which is shown in the following formula: AAER = n m • (Eq. 8 • λ1 + Eq. 9 • λ2) , where λ1 and λ2 is the hyperparameter to control the strength of the regularization term. AAER can effectively hinder the generation of abnormal adversarial examples which are highly correlated with distorted classifier, thereby encouraging training of smoother classifiers that can better defend against adversarial attacks. Furthermore, the strength of AAER depends on the product of the number and outputs variation of abnormal adversarial examples, which can more comprehensively and flexibly reflect the degree of classifier distortion. The algorithm realization is summarized in Algorithm 1. Note that we employ increasing α to stabilize the optimization objective and avoid model training to crash in the early stages.

4. EXPERIMENT

In this section, we conduct extensive experiments to verify the effectiveness of AAER including experiment settings (Section 4.1), performance evaluations (Section 4.2), ablation studies (Section 4.3) and time complexity study (Section 4.4).

Algorithm 1 Abnormal Adversarial Examples Regularization (AAER)

Input: network f θ , epochs T, mini-batch M, perturbation radius ϵ, step size α, initialization term η. 1: for t = 1 . . . T to do 2: for i = 1 . . . M to do 3: α = t/T • α 4: δ = α • sign (∇ x+η ℓ(x i + η, y i ; θ)) 5: CEloss = 1 m m i=1 ℓ (x i + η + δ, y i ; θ) 6: AAERloss = Eq. ( 10) 7: θ = θ -∇ θ (CEloss + t/T • AAERloss, ) 8: end for 9: end for

4.1. EXPERIMENT SETTING

Baselines. We compare our method with other SSAT methods including RS-FGSM (Wong et al., 2020) , ATTA (Zheng et al., 2020) , FreeAT (Shafahi et al., 2019) , N-FGSM (de Jorge et al., 2022), Grad Align (Andriushchenko & Flammarion, 2020) , ZeroGrad and MultiGrad (Golgooni et al., 2021) . We also compare our method with iterative-step AT PGD-2 and PGD-10 ( Madry et al., 2017) providing a reference for the ideal performance. To accommodate different adversarial budgets, we use PGD-10 with two step size of 2/255 and ϵ/10. We will show natural and robust accuracy results using the hyperparameters reported in their official repository (except for FreeAT, we do not divide the number of epochs by m to keep the same training epochs). It is worth noting that we do not use early stopping (Wong et al., 2020) as this technique can restore the robustness of all methods. Datasets and Model Architectures. We will show the results on the benchmark datasets Cifar-10/100 (Krizhevsky et al., 2009) and use random cropping and horizontal flipping for data argumentation. We use the PreactResNet-18 (He et al., 2016) and WideResNet-34 (Zagoruyko & Komodakis, 2016) architectures on these datasets to evaluate results. The training results of WideResNet-34 are also available in the Appendix B. We also report the settings and results of our method on SVHN (Netzer et al., 2011) and Tiny-imagenet (Netzer et al., 2011) in the Appendix E. Attack Methods and Learning Rate Schedule. To report the robust accuracy of models, we attack these methods using the standard PGD adversarial attack with α = ϵ/4, 10 restarts and 50 attack steps. We also evaluate our methods based on Auto Attack in the Appendix C. We use the cyclical learning rate schedule (Smith, 2017) with 30 epochs that reaches its maximum learning rate (0.2 in our experiments) when half of the epochs (15) are passed on Cifar-10/100. Setup for Our Proposed Method. In this work, we use the SGD optimizer with momentum of 0.9 and weight decay of 5 × 10 -4 . We chose L ∞ as the threat model and set gradient ascent step size α = 1.5 • ϵ. We set η = Uniform(-ϵ, ϵ) for random initialization, and the η setting for previous initialization can be found in the Appendix A. We will show the best λ settings in the Appendix D. It is worth noting that our method can also achieve robustness without tuning hyperparameters with different adversarial budgets, the results on universal λ in the Appendix D.

4.2. PERFORMANCE EVALUATION

In this part, we report the experimental results of our method under four different settings AAER RC : AAER with random initialization and clipped perturbations and AAER RUC : AAER with random initialization and unclipped perturbations. The unclipped technique was proposed by de Jorge et al. ( 2022), who claimed that clipping is performed after taking a gradient ascent step, which may make adversarial samples no longer effective. The AAER based on the previous initialization is available in the Appendix A.

CIFAR10 Results.

In Table 1 , we present an evaluation of the proposed methods with the competing baselines on the CIFAR-10 dataset. First, we can observe that RS-FGSM, ATTA and FreeAT suffer from CO with strong adversaries. We can also observe an interesting phenomenon that some weakly robust methods will recover partial robustness with large noise magnitude 32/255. Table 1 shows that our proposed methods can significantly improve the robust accuracy, achieve superior 

4.3. ABLATION STUDY

In this part, we investigate the impacts of AAER RC with 16/255 noise magnitude using PreactResNet-18 on CIFAR10 dataset under L ∞ threat model. Optimization Objectives. To verify the effectiveness of our proposed method, we show the change in the three optimization objectives during training in Figure 3 . We can observe that the number, prediction confidence and logits variation of abnormal adversarial examples are well constrained by AAER throughout the training. We also try to simply ignore abnormal adversarial examples and train only on normal ones. Unfortunately, this method does not work due to the abnormal λ Selection. To verify the effectiveness of our proposed method, we investigate the effect of different sizes λ1 and λ2 on natural and robustness performance. From the figure 4 (left), we can observe that the effect of λ1 does not seem to be significant. However, it acts as a buffer to prevent the AAER from changing too drastically, and we chose λ1 of 8.0 for optimal preference. From the Figure 4 (right), we can observe that the model can successfully prevent CO when λ2 is not 0, which proves that our method can effectively eliminate CO. Under the same experimental setting mentioned before, with the value of λ varying from 0 to 10.0, we can observe that choosing λ2 of 5.0 can achieve the best robustness.

4.4. TIME COMPLEXITY STUDY

We show the time complexity of different AT methods in Table 2 , we can observe that the running time for one epoch of AAER is almost equal to the RS-FGSM method. In contrast, Grad Align and PGD-10 are 2.3 and 4.6 times slower than our method, respectively. 

5. CONCLUSION

In this paper, we find that the abnormal adversarial examples exhibit anomalous behaviour, i.e. they are further to the decision boundaries after adding perturbations generated by the inner maximization process. We empirically show that the catastrophic overfitting is closely related to the abnormal adversarial examples by analyzing their number and outputs variation during model training. Motivated by this, we propose a novel and effective method, Abnormal Adversarial Examples Regularization (AAER), through a regularizer to eliminate catastrophic overfitting by suppressing generated abnormal adversarial examples. Our approach can successfully resolve the catastrophic overfitting with different noise magnitudes and achieve state-of-the-art preference with computational convenience in various settings.

A EXPERIMENT WITH PREVIOUS INITIALIZATION

Most works build perturbations based on zero or random initialization, but Zheng et al. (2020) and Liu et al. (2021) found that perturbations are highly transferable between models from adjacent epochs, so they used perturbations from adjacent epochs to intensify the effect of perturbations, which is formalized as follows: η t = (η t-1 + δ t-1 ) • β, ( ) where t is the epoch, η t-1 and δ t-1 saved from the adjacent epoch and β is the hyperparameter to control the strength of the initialization. In this part, we will show the effect of our method by using previous initialization by AAER PC : AAER with previous initialization and clipped perturbations and AAER PUC : AAER with previous initialization and unclipped perturbations. We set β = 0.5 for the previous initialization experiments, and report the results on Cifar10/100 in Table 3 and Table 4 . From Table 3 and Table 4 , we can observe that our method with previous initialization can still successfully achieve high robustness, even achieving better robustness in some settings compared to random initialization. However, using previous initialization has some negative effects on natural accuracy. β Selection. The hyperparameter β determines the strength of the previous initialization perturbations, and the effect of different β on test accuracy is shown in Figure 5 . When β is 0 which is equivalent to using zero initialization, increasing β leads to higher natural accuracy. When β is greater than 0.5, it is observed increasing β makes model robustness decrease. Therefore, we set β to 0.5 to achieve the best trade-off between natural and robust test accuracy. Data Argumentation Technique. We notice that Zheng et al. (2020) proposed a data argumentation technique ATTA, which adds different arguments at each epoch. We add this data argumentation technique on AAER PC as shown in Table 3 and Table 4 . We can observe that the ATTA does not or slightly improves our method accuracy, but the training time will significantly increase from 30.5S to 43.1S. Therefore, our method AAER do not use the data argumentation technique ATTA.

B EXPERIMENT WITH WIDERESNET ARCHITECTURE

We also compare the performance of our method using WideResNet, which is more complex than PreActResNet. The settings are the same as PreActResNet-18, and we report the results on Ci-far10/100 in Table 5 and Table 6 . From Table 5 and Table 6 , we can observe that our method can still successfully achieve high robustness in other architectures. Although, the PGD-10 AT seems to better utilize the complex network to achieve higher natural and robust accuracy. However, it is worth noting that complex networks can better reflect the efficiency of our method in terms of training time, while our method can achieve comparable robustness. In Table 7 and Table 8 , we can observe that our method can still achieve boost adversarial robustness in different adversarial attacks. Surprisingly, the unclipped AAER achieves higher robustness with 32/255 noise magnitude under Auto Attack, which is slightly different from the result under PGD-50-10 attack.

D EXPERIMENT WITH UNIVERSAL λ

It is worth noting that unlike other SSAT methods (such as Grad Align (Andriushchenko & Flammarion, 2020) and ZeroGrad (Golgooni et al., 2021 )), our method can achieve robustness without tuning hyperparameters with different adversarial budgets, the universal λ settings are shown in Ta- 10 . For CIFAR-10, we set λ1 = 8.0 λ2 = 5.0 for AAER with clipped perturbations, and λ1 = 6.5 λ2 = 5.0 for AAER with unclipped perturbations. For CIFAR-100, we set λ1 = 7.5 λ2 = 3.0, and λ1 = 6.5 λ2 = 2.5 for AAER with clipped and unclipped perturbations, respectively. We report the universal λ results on Cifar10/100 in Table 11 and Table 12 . We can observe that the results using universal λ can still achieve high robustness on both datasets. The absence of hyperparameter tuning provides our method with unparalleled generality and adaptability. Table 11: CIFAR10: Accuracy of universal AAER with different noise magnitudes using PreactResNet-18 under L ∞ threat model. The top number is the natural accuracy while the bottom number is the PGD-50-10 accuracy. The results are averaged over 3 random seeds and reported with the standard deviation. 



Figure 1: Model robust test accuracy with different noise magnitudes. The red and green lines are defence against FGSM and PGD-7-1 adversarial attack, respectively. The dashed line and solid line noise magnitude are 8/255 and 16/255, respectively. Dashed black lines correspond to the 10th epoch, which is the point that model occurs CO.

Figure 2: Visualization of classifier decision boundary and training samples. The left panel shows that the training samples generated on the normal decision boundary are all belong to the normal adversarial example (blue) which can mislead the classifier. The middle panel shows that some training samples generate on the distorted decision boundary that cannot mislead the classifier, which we called abnormal adversarial example (red).

Figure 3: The number, prediction confidence and logits variation of normal/abnormal adversarial examples and training samples. The left, middle and right panel shows the number, prediction confidence and logits variation, respectively. The green/red and blue lines represent normal/abnormal adversarial examples and training samples, respectively. Dashed black lines correspond to the 10th epoch, which is the point that model occurs CO. The yellow line represents the number, prediction confidence and the logits variation of abnormal adversarial examples under the AAER method.

Figure 4: Ablation Study. The red and green line are the natural and robust test accuracy, respectively. Left panel: Effect of different sizes λ1, and we fix λ2 as 5.0 at this experiment. Right panel: Effect of different sizes λ2, and we fix λ1 as 8.0 at this experiment. adversarial example is not the cause of decision boundary distortion but rather co-occurs. Thus, ignoring abnormal adversarial examples cannot repair existing decision boundary distortion.

Figure 5: Effect of different size β. The red and green line are the natural and robust test accuracy, respectively. We do this experiment based on AAER PC with 16/255 noise magnitude.

CIFAR10/100: Accuracy of different methods and different noise magnitudes using PreActResNet-18 under L ∞ threat model. The left and right panel are the CIFAR10 and CIFAR100 results, respectively. The top number is the natural accuracy while the bottom number is the PGD-50-10 accuracy. The results are averaged over 3 random seeds and reported with the standard deviation. We also conduct experiments on the CIFAR100 dataset. Note that CIFAR100 is more challenging than CIFAR10 as the number of classes/training images per class is ten times larger/smaller than that of CIFAR10. As shown by the results in Table1, the proposed methods are still able to prevent CO and improve robust accuracy. It verifies that the AAER can reliably prevent CO and is general across different datasets.

CIFAR10 training time on a single NVIDIA Tesla V100 GPU using PreactResNet-18. The results are averaged over 30 epochs.

CIFAR10: Accuracy of different methods and different noise magnitudes using PreActResNet-18 under L ∞ threat model. The top number is the natural accuracy while the bottom number is the PGD-50-10 accuracy. The results are averaged over 3 random seeds and reported with the standard deviation.

CIFAR100: Accuracy of different methods and different noise magnitudes using PreActResNet-18 under L ∞ threat model. The top number is the natural accuracy while the bottom number is the PGD-50-10 accuracy. The results are averaged over 3 random seeds and reported with the standard deviation.

CIFAR10: Accuracy of different methods with 8/255 noise magnitude using WideResNet-34 under L ∞ threat model. The results are averaged over 3 random seeds and reported with the standard deviation.

CIFAR100: Accuracy of different methods with 8/255 noise magnitude using WideResNet-34 under L ∞ threat model. The results are averaged over 3 random seeds and reported with the standard deviation. AttackCroce & Hein (2020) is regarded as the most reliable robustness evaluation to date, which is an ensemble of complementary attacks, consisting of three white-box attacks APGD-CE, APGD-DLR, and FAB and a black-box attack Square Attack. We report the results on Cifar10/100 in Table7 and Table 8.

CIFAR10: Accuracy of different methods and different noise magnitudes using PreactResNet-18 under L ∞ threat model. The number is the Auto Attack accuracy while the natural accuracy is same as Table1. The results are averaged over 3 random seeds and reported with the standard deviation.

CIFAR100: Accuracy of different methods and different noise magnitudes using PreactResNet-18 under L ∞ threat model. The number is the Auto Attack accuracy while the natural accuracy is same as Table1. The results are averaged over 3 random seeds and reported with the standard deviation.

CIFAR10: The best and universal setting for different noise magnitudes. Last panel is universal λ setting, other panels are best λ setting. The top number is λ1 while the bottom number is λ2.

CIFAR100: The best and universal setting for different noise magnitudes. Last panel is universal λ setting, other panels are best λ setting. The top number is λ1 while the bottom number is λ2.

SVHN: The best setting for different noise magnitudes. The top number is λ1 while the bottom number is λ2.

SVHN: Accuracy of different methods and different noise magnitudes using PreActResNet-18 under L ∞ threat model. The top number is the natural accuracy while the bottom number is the PGD-50-10 accuracy. The results are averaged over 3 random seeds and reported with the standard deviation.

Tiny-imagenet: Accuracy of different methods with 8/255 noise magnitude using PreActResNet-18 under L ∞ threat model. The results are averaged over 3 random seeds and reported with the standard deviation. .92 ± 0.39 44.38 ± 0.27 52.28 ±2.64 48.16 ± 0.61 46.43 ± 0.35 robust accuracy 19.38 ± 0.14 21.85 ± 0.01 0.00 ± 0.00 20.82 ± 0.40 20.72 ± 0.32

