ELIMINATING CATASTROPHIC OVERFITTING VIA AB-NORMAL ADVERSARIAL EXAMPLES REGULARIZA-TION

Abstract

Single-step adversarial training (SSAT) is shown to be able to defend against iterative-step adversarial attacks to achieve both efficiency and robustness. However, SSAT suffers from catastrophic overfitting (CO) with strong adversaries, showing that the classifier decision boundaries are highly distorted and robust accuracy against iterative-step adversarial attacks suddenly drops from peak to nearly 0% in a few epochs. In this work, we find that some adversarial examples generated on the network trained by SSAT exhibit anomalous behaviour, that is, although the training data is generated by the inner maximization process, the loss of some adversarial examples decreases instead, which we called abnormal adversarial examples. Furthermore, network optimization on these abnormal adversarial examples will further accelerate the model decision boundaries distortion, and correspondingly, the number of abnormal adversarial examples will sharply increase with CO. These observations motivate us to eliminate CO by hindering the generation of abnormal adversarial examples. Specifically, we design a novel method, Abnormal Adversarial Examples Regularization (AAER), which explicitly regularizes the number and outputs variation of abnormal adversarial examples to hinder the model from generating abnormal adversarial examples. Extensive experiments demonstrate that our method can eliminate CO and further boost adversarial robustness with strong adversaries.

1. INTRODUCTION

In recent years, Deep Neural Networks (DNNs) have performed impressively in various fields, such as autonomous driving (Litman, 2017) , face recognition (Sharif et al., 2016) and medical imaging diagnosis (Buch et al., 2018) . However, DNNs were found to be vulnerable to adversarial examples (Szegedy et al., 2013) . Although these adversarial examples are imperceptible to the human eyes, they can lead to a completely different prediction in DNNs. To this end, many adversarial defense methods have been proposed, such as verification and provable defense (Katz et al., 2017) , preprocessing techniques (Guo et al., 2017) , detection algorithms (Metzen et al., 2017) and adversarial training (AT) (Goodfellow et al., 2014) . Among them, AT is considered to be one of the most effective methods against adversarial attacks (Athalye et al., 2018) . However, standard iterativestep AT significantly increases computational overhead due to multiple steps forward and backward propagation. Therefore, some works attempt to improve the vanilla single-step adversarial training (SSAT) to defend against iterative-step adversarial attacks while maintaining efficiency and robustness. Unfortunately, a serious problem -catastrophic overfitting (CO) -occurs with stronger adversaries (Wong et al., 2020) . This strange phenomenon means that the robust accuracy of the model against the iterative-step adversarial attack suddenly from peak drops to nearly zero during a few epochs, as shown in Figure 1 . This intriguing phenomenon has been widely investigated and led to many works to resolve CO. Recently, Kim et al. (2021) points out that networks in which CO occurs are generally accompanied by highly distorted decision boundaries. However, the interaction between distorted decision boundaries and CO remains unclear. In this work, we delve into the dynamic effects between CO and decision boundaries distortion. Specifically, we find some adversarial examples generated on the network with distorted decision boundaries exhibit anomalous behavior,

