CATASTROPHIC OVERFITTING IS A BUG BUT IT IS CAUSED BY FEATURES

Abstract

Adversarial training (AT) is the de facto method to build robust neural networks, but it is computationally expensive. To overcome this, fast single-step attacks can be used, but doing so is prone to catastrophic overfitting (CO). This is when networks gain non-trivial robustness during the first stages of AT, but then reach a breaking point where they become vulnerable in just a few iterations. Although some works have succeeded at preventing CO, the different mechanisms that lead to this failure mode are still poorly understood. In this work, we study the onset of CO in singlestep AT methods through controlled modifications of typical datasets of natural images. In particular, we show that CO can be induced when injecting the images with seemingly innocuous features that are very useful for non-robust classification but need to be combined with other features to obtain a robust classifier. This new perspective provides important insights into the mechanisms that lead to CO and improves our understanding of the general dynamics of adversarial training.

1. INTRODUCTION

Deep neural networks are sensitive to imperceptible worst-case perturbations, also known as adversarial perturbations (Szegedy et al., 2014) . As a consequence, training neural networks that are robust to such perturbations has been an active area of study in recent years (see Ortiz-Jiménez et al. (2021) for a review). In particular, a prominent line of research, referred to as adversarial training (AT), focuses on online data augmentation with adversarial samples during training. However, it is well known that finding these adversarial samples for deep neural networks is an NP-hard problem (Weng et al., 2018) . In practice, this is usually overcome with various methods, referred to as adversarial attacks that find approximate solutions to this hard problem. The most popular attacks are based on projected gradient descent (PGD) (Madry et al., 2018) -a computationally expensive algorithm that requires multiple steps of forward and backward passes through the neural network to approximate the solution. This hinders its use in many large-scale applications motivating the use of alternative efficient single-step attacks (Goodfellow et al., 2015; Shafahi et al., 2019; Wong et al., 2020) . The use of the computationally efficient single-step attacks within AT, however, comes with concerns regarding its stability. While training, although there is an initial increase in robustness, the networks often reach a breaking point beyond which they lose all gained robustness in just a few iterations (Wong et al., 2020) . This phenomenon is known as catastrophic overfitting (CO) (Wong et al., 2020; Andriushchenko & Flammarion, 2020) . Nevertheless, given the clear computational advantage of using single-step attacks during AT, a significant body of work has been dedicated to finding ways to circumvent CO via regularization and data augmentation (Andriushchenko & Flammarion, 2020; Vivek & Babu, 2020; Kim et al., 2021; Park & Lee, 2021; Golgooni et al., 2021; de Jorge et al., 2022) . Despite the recent methodological advances in this front, the root cause of CO remains poorly understood. Due to the inherent complexity of this problem, we argue that identifying the causal mechanisms behind CO cannot be done through observations alone and requires active interventions Ilyas et al. (2019) . That is, we need to be able to synthetically induce CO in settings where it would not naturally happen otherwise. In this work, we identify one such type of intervention that allows to perform abundant experiments to explain multiple aspects of CO. Specifically, the main contributions of our work are: (i) We show that CO can be induced by injecting features that, despite being strongly discriminative (i.e. useful for standard classification), are not sufficient for robust classification (see Fig. 1 ). (ii) Through

