ELIMINATING CATASTROPHIC OVERFITTING VIA AB-NORMAL ADVERSARIAL EXAMPLES REGULARIZA-TION

Abstract

Single-step adversarial training (SSAT) is shown to be able to defend against iterative-step adversarial attacks to achieve both efficiency and robustness. However, SSAT suffers from catastrophic overfitting (CO) with strong adversaries, showing that the classifier decision boundaries are highly distorted and robust accuracy against iterative-step adversarial attacks suddenly drops from peak to nearly 0% in a few epochs. In this work, we find that some adversarial examples generated on the network trained by SSAT exhibit anomalous behaviour, that is, although the training data is generated by the inner maximization process, the loss of some adversarial examples decreases instead, which we called abnormal adversarial examples. Furthermore, network optimization on these abnormal adversarial examples will further accelerate the model decision boundaries distortion, and correspondingly, the number of abnormal adversarial examples will sharply increase with CO. These observations motivate us to eliminate CO by hindering the generation of abnormal adversarial examples. Specifically, we design a novel method, Abnormal Adversarial Examples Regularization (AAER), which explicitly regularizes the number and outputs variation of abnormal adversarial examples to hinder the model from generating abnormal adversarial examples. Extensive experiments demonstrate that our method can eliminate CO and further boost adversarial robustness with strong adversaries.

1. INTRODUCTION

In recent years, Deep Neural Networks (DNNs) have performed impressively in various fields, such as autonomous driving (Litman, 2017) , face recognition (Sharif et al., 2016) and medical imaging diagnosis (Buch et al., 2018) . However, DNNs were found to be vulnerable to adversarial examples (Szegedy et al., 2013) . Although these adversarial examples are imperceptible to the human eyes, they can lead to a completely different prediction in DNNs. To this end, many adversarial defense methods have been proposed, such as verification and provable defense (Katz et al., 2017) , preprocessing techniques (Guo et al., 2017) , detection algorithms (Metzen et al., 2017) and adversarial training (AT) (Goodfellow et al., 2014) . Among them, AT is considered to be one of the most effective methods against adversarial attacks (Athalye et al., 2018) . However, standard iterativestep AT significantly increases computational overhead due to multiple steps forward and backward propagation. Therefore, some works attempt to improve the vanilla single-step adversarial training (SSAT) to defend against iterative-step adversarial attacks while maintaining efficiency and robustness. Unfortunately, a serious problem -catastrophic overfitting (CO) -occurs with stronger adversaries (Wong et al., 2020) . This strange phenomenon means that the robust accuracy of the model against the iterative-step adversarial attack suddenly from peak drops to nearly zero during a few epochs, as shown in Figure 1 . This intriguing phenomenon has been widely investigated and led to many works to resolve CO. Recently, Kim et al. ( 2021) points out that networks in which CO occurs are generally accompanied by highly distorted decision boundaries. However, the interaction between distorted decision boundaries and CO remains unclear. In this work, we delve into the dynamic effects between CO and decision boundaries distortion. Specifically, we find some adversarial examples generated on the network with distorted decision boundaries exhibit anomalous behavior, that is, although all training samples are generated by the inner maximization process, the loss of some adversarial examples decreases instead. We refer to these training samples as abnormal adversarial examples. To make matters worse, the decision boundaries distortion will further exacerbate by optimising the classifier directly on these abnormal adversarial examples, and the number of abnormal adversarial examples will sharply increase as a result, which leads to a vicious circle between the number of abnormal adversarial example and the decision boundaries distortion. All these atypical findings raise a question:

Can CO be prevented by hindering the generation of abnormal adversarial examples?

To answer the above question, we design a novel method, Abnormal Adversarial Examples Regularization (AAER), which incorporates a regularizer that prevents CO by suppressing generated abnormal adversarial examples. Specifically, AAER consists of two key components: (i) the number and (ii) outputs variation of abnormal adversarial examples. The first part (i) counts the sample number by dividing the training samples into groups of normal and abnormal adversarial examples through anomalous loss decrease behavior. The second part (ii) contains prediction confidence and logits variation, and calculates these two variations differences between the two groups of samples by cross-entropy and Euclidean distance, respectively. Then, AAER explicitly regularizes the number and outputs variation of abnormal adversarial examples by these two parts to hinder the model from generating abnormal adversarial examples. Extensive experiments show that our method can well eliminate CO and further improve the adversarial robustness. It is worth noting that our method does not involve the extra generation and backward propagation process, which brings us unparalleled convenience in computational overhead. Our major contributions are summarized as follows: • We found some training samples exhibit anomalous loss variation during the inner maximization process. Besides, the number of abnormal adversarial examples will sharply increase with CO, and the model will further exacerbate by optimising the classifier directly on these abnormal adversarial examples. • Based on the observed effect, we propose a novel method -Abnormal Adversarial Examples Regularization (AAER), which explicitly regularizes the number of abnormal adversarial examples and their anomalous outputs variation to hinder the generation of abnormal adversarial examples. Extensive experiments demonstrate that our method can prevent CO and automatically adapt to different noise magnitudes without hyperparameter tuning. • We evaluate the effectiveness of our method across different adversarial budgets, adversarial attacks, datasets and network architectures, showing that our proposed method consistently achieves state-of-the-art robust accuracy in SSAT and can obtain comparable robustness to standard iterative-step AT with only negligible computational overhead.



Figure 1: Model robust test accuracy with different noise magnitudes. The red and green lines are defence against FGSM and PGD-7-1 adversarial attack, respectively. The dashed line and solid line noise magnitude are 8/255 and 16/255, respectively. Dashed black lines correspond to the 10th epoch, which is the point that model occurs CO.

