UNDERSTANDING CATASTROPHIC OVERFITTING IN FAST ADVERSARIAL TRAINING FROM A NON-ROBUST FEATURE PERSPECTIVE

Abstract

To make adversarial training (AT) computationally efficient, FGSM AT has attracted significant attention. The fast speed, however, is achieved at the cost of catastrophic overfitting (CO), whose reason remains unclear. Prior works mainly study the phenomenon of a significant PGD accuracy (Acc) drop to understand CO while paying less attention to its FGSM Acc. We highlight an intriguing CO phenomenon that FGSM Acc is higher than accuracy on clean samples and attempt to apply non-robust feature (NRF) to understand it. Our investigation of CO by extending the existing NRF into fine-grained categorization suggests: there exists a certain type of NRF whose usefulness is increased after FGSM attack, and CO in FGSM AT can be seen as a dynamic process of learning such NRF. Therefore, the key to preventing CO lies in reducing its usefulness under FGSM AT, which sheds new light on understanding the success of a SOTA technique for mitigating CO.

1. INTRODUCTION

Despite impressive performance, deep neural networks (DNNs) (LeCun et al., 2015; He et al., 2016; Huang et al., 2017; Zhang et al., 2019a; 2021) are widely recognized to be vulnerable to adversarial examples (Szegedy et al., 2013; Biggio et al., 2013; Akhtar & Mian, 2018) . Without giving a false sense of robustness against adversarial attacks (Carlini & Wagner, 2017; Athalye et al., 2018; Croce & Hein, 2020) , adversarial training (AT) (Madry et al., 2018; Zhang et al., 2019c) has become the de facto standard approach for obtaining an adversarially robust model via solving a min-max problem in two-step manner. Specifically, it first generates adversarial examples by maximizing the loss, then trains the model on the generated adversarial examples by minimizing the loss. PGD-N AT (Madry et al., 2018; Zhang et al., 2019c ) is a classical AT method, where N is the iteration steps when generating the adversarial samples in inner maximization. Notably, PGD-N AT is N times slower than its counterpart standard training with clean samples. A straightforward approach to make AT faster is to set N to 1, i.e reducing the attack in the inner maximization from multi-step PGD to single-step FGSM (Goodfellow et al., 2015) . For simplicity, PGD-based AT and FGSM-based fast AT are termed PGD AT and FGSM AT, respectively. FGSM AT often fails with a sudden robustness drop against PGD attack while maintaining its robustness against FGSM attack, which is called catastrophic overfitting (CO) (Wong et al., 2020) . With Standard Acc denoting the accuracy on clean samples while FGSM Acc and PGD Acc indicating the accuracy under FGSM and PGD attack, we emphasize that a CO model is characterized by two main phenomena as follows. • Phenomenon 1: The PGD Acc drops to a value close to zero when CO happens (Wong et al., 2020; Andriushchenko & Flammarion, 2020 ). • Phenomenon 2: FGSM Acc is higher than Standard Acc for a CO model (Kim et al., 2020; Andriushchenko & Flammarion, 2020) . Multiple works (Wong et al., 2020; Kim et al., 2020; Andriushchenko & Flammarion, 2020) have focused on understanding CO by explaining the drop of PGD Acc in Phenomenon 1; however, they pay less attention to Phenomenon 2 regarding FGSM Acc. Specifically for Phenomenon 1, FGSM-RS (Wong et al., 2020) attributes it to the lack of perturbation diversity in FGSM AT, which

