UNDERSTANDING CATASTROPHIC OVERFITTING IN FAST ADVERSARIAL TRAINING FROM A NON-ROBUST FEATURE PERSPECTIVE

Abstract

To make adversarial training (AT) computationally efficient, FGSM AT has attracted significant attention. The fast speed, however, is achieved at the cost of catastrophic overfitting (CO), whose reason remains unclear. Prior works mainly study the phenomenon of a significant PGD accuracy (Acc) drop to understand CO while paying less attention to its FGSM Acc. We highlight an intriguing CO phenomenon that FGSM Acc is higher than accuracy on clean samples and attempt to apply non-robust feature (NRF) to understand it. Our investigation of CO by extending the existing NRF into fine-grained categorization suggests: there exists a certain type of NRF whose usefulness is increased after FGSM attack, and CO in FGSM AT can be seen as a dynamic process of learning such NRF. Therefore, the key to preventing CO lies in reducing its usefulness under FGSM AT, which sheds new light on understanding the success of a SOTA technique for mitigating CO.

1. INTRODUCTION

Despite impressive performance, deep neural networks (DNNs) (LeCun et al., 2015; He et al., 2016; Huang et al., 2017; Zhang et al., 2019a; 2021) are widely recognized to be vulnerable to adversarial examples (Szegedy et al., 2013; Biggio et al., 2013; Akhtar & Mian, 2018) . Without giving a false sense of robustness against adversarial attacks (Carlini & Wagner, 2017; Athalye et al., 2018; Croce & Hein, 2020) , adversarial training (AT) (Madry et al., 2018; Zhang et al., 2019c) has become the de facto standard approach for obtaining an adversarially robust model via solving a min-max problem in two-step manner. Specifically, it first generates adversarial examples by maximizing the loss, then trains the model on the generated adversarial examples by minimizing the loss. PGD-N AT (Madry et al., 2018; Zhang et al., 2019c ) is a classical AT method, where N is the iteration steps when generating the adversarial samples in inner maximization. Notably, PGD-N AT is N times slower than its counterpart standard training with clean samples. A straightforward approach to make AT faster is to set N to 1, i.e reducing the attack in the inner maximization from multi-step PGD to single-step FGSM (Goodfellow et al., 2015) . For simplicity, PGD-based AT and FGSM-based fast AT are termed PGD AT and FGSM AT, respectively. FGSM AT often fails with a sudden robustness drop against PGD attack while maintaining its robustness against FGSM attack, which is called catastrophic overfitting (CO) (Wong et al., 2020) . With Standard Acc denoting the accuracy on clean samples while FGSM Acc and PGD Acc indicating the accuracy under FGSM and PGD attack, we emphasize that a CO model is characterized by two main phenomena as follows. • Phenomenon 1: The PGD Acc drops to a value close to zero when CO happens (Wong et al., 2020; Andriushchenko & Flammarion, 2020 ). • Phenomenon 2: FGSM Acc is higher than Standard Acc for a CO model (Kim et al., 2020; Andriushchenko & Flammarion, 2020) . Multiple works (Wong et al., 2020; Kim et al., 2020; Andriushchenko & Flammarion, 2020) To understand Phenomenon 2 in CO from the NRF perspective, we conjecture that there exists a type of NRF whose usefulness is increased under FGSM attack, thus can lead to a higher FGSM Acc than Standard Acc (Phenomenon 2). In other words, if such type of NRFs (NRF2 in the following categorization) exists, Phenomenon 2 can be justified. Considering whether the usefulness is decreased or increased under FGSM attack, we propose a direction-based NRF categorization where NRF2 (NRF1) leads to the increase (decrease) of classification accuracy under FGSM attack. To prove the existence of NRF2, we follow the procedure of verifying the existence of NRF in (Ilyas et al., 2019) . Moreover, we show that NRF2 can cause a significant PGD Acc drop , which also helps justify Phenomenon 1 in CO. Overall, towards understanding CO in FGSM AT, our contributions are summarized as follows: • Our work shifts the previous focus on PGD Acc in Phenomenon 1 to FGSM Acc in Phenomenon 2 for understanding CO. Given NRF as a popular perspective on adversarial vulnerability, we are the first to attempt at applying it to explain Phenomenon 2. • We extend the existing NRF framework under PGD attack (Ilyas et al., 2019) to more fine-grained NRF categorization by FGSM attack. We verify the existence of NRF2 and show that its existence well justifies Phenomenon 2 (as well as Phenomenon 1). • Very recent works show that adding noise on the image input achieves SOTA performance for FGSM AT. However, their mechanism of such a simple technique preventing CO remains not fully clear, for which our NRF2 perspective shed new light on its success. 



PROBLEM OVERVIEW AND RELATED WORK 2.1 FGSM AT AND EXPERIMENTAL SETUPS Let D denote a data distribution with (x, y) pairs and f (•, θ) parameterized by θ denote a deep model. For standard training, the model f (•, θ) is trained on D by minimizing E (x,y)∼D [l(f (x, θ), y)], where l indicates a cross-entropy loss for a typical multi-class classification task. Adversarial training

a follow-up GradAlign(Andriushchenko & Flammarion, 2020) by demonstrating a co-occurrence of local non-linearity and the PGD Acc drop. However, these understandings cannot explain why FGSM Acc is higher than Standard Acc for a CO model in Phenomenon 2.Athalye et al., 2018). Such compatibility suggests that the NRF perspective constitutes an essential tool for understanding adversarial vulnerability, to which CO is also directly related. Specifically, the authors of(Ilyas et al., 2019)  define the positive-correlation between features and true labels as feature usefulness (see Section 3.1 for more detailed definitions). Therefore, the adversarial vulnerability of DNNs is attributed to the existence of non-robust features (NRFs), which can be made anti-correlated with the true label under adversary. This understanding of NRFs in(Ilyas  et al., 2019)  well aligns with the fact that a CO model achieves close to zero robustness against PGD attack, and thus motivates us to believe that the NRF perspective might be an auspicious direction for understanding CO in FGSM AT.The NRF in(Ilyas et al., 2019)  is defined with PGD attack, which is followed in this work; however, we extend their NRF framework by additionally considering FGSM attack for fine-grained categorization. Considering the difference of adversarial attack strength between FGSM and PGD attack, GradAlign(Andriushchenko & Flammarion, 2020)  explains Phenomenon 1 by demonstrating how well the attack variant (FGSM or PGD attack) can solve the inner maximization problem in AT. We start our investigation by providing an alternative interpretation of this adversarial strength difference between the two attack variants within the NRF framework(Ilyas et al., 2019), named strength-based NRF categorization. Despite aligning well with Phenomenon 1, We find that this strength-based categorization cannot explain Phenomenon 2 since the usefulness of these NRFs is decreased under FGSM attack and leads to an decrease (instead of increase in Phenomenon 2) of classification accuracy on FGSM adversarial examples than clean samples.

