CATASTROPHIC OVERFITTING IS A BUG BUT IT IS CAUSED BY FEATURES

Abstract

Adversarial training (AT) is the de facto method to build robust neural networks, but it is computationally expensive. To overcome this, fast single-step attacks can be used, but doing so is prone to catastrophic overfitting (CO). This is when networks gain non-trivial robustness during the first stages of AT, but then reach a breaking point where they become vulnerable in just a few iterations. Although some works have succeeded at preventing CO, the different mechanisms that lead to this failure mode are still poorly understood. In this work, we study the onset of CO in singlestep AT methods through controlled modifications of typical datasets of natural images. In particular, we show that CO can be induced when injecting the images with seemingly innocuous features that are very useful for non-robust classification but need to be combined with other features to obtain a robust classifier. This new perspective provides important insights into the mechanisms that lead to CO and improves our understanding of the general dynamics of adversarial training.

1. INTRODUCTION

Deep neural networks are sensitive to imperceptible worst-case perturbations, also known as adversarial perturbations (Szegedy et al., 2014) . As a consequence, training neural networks that are robust to such perturbations has been an active area of study in recent years (see Ortiz-Jiménez et al. (2021) for a review). In particular, a prominent line of research, referred to as adversarial training (AT), focuses on online data augmentation with adversarial samples during training. However, it is well known that finding these adversarial samples for deep neural networks is an NP-hard problem (Weng et al., 2018) . In practice, this is usually overcome with various methods, referred to as adversarial attacks that find approximate solutions to this hard problem. The most popular attacks are based on projected gradient descent (PGD) (Madry et al., 2018) -a computationally expensive algorithm that requires multiple steps of forward and backward passes through the neural network to approximate the solution. This hinders its use in many large-scale applications motivating the use of alternative efficient single-step attacks (Goodfellow et al., 2015; Shafahi et al., 2019; Wong et al., 2020) . The use of the computationally efficient single-step attacks within AT, however, comes with concerns regarding its stability. While training, although there is an initial increase in robustness, the networks often reach a breaking point beyond which they lose all gained robustness in just a few iterations (Wong et al., 2020) . This phenomenon is known as catastrophic overfitting (CO) (Wong et al., 2020; Andriushchenko & Flammarion, 2020) . Nevertheless, given the clear computational advantage of using single-step attacks during AT, a significant body of work has been dedicated to finding ways to circumvent CO via regularization and data augmentation (Andriushchenko & Flammarion, 2020; Vivek & Babu, 2020; Kim et al., 2021; Park & Lee, 2021; Golgooni et al., 2021; de Jorge et al., 2022) . Despite the recent methodological advances in this front, the root cause of CO remains poorly understood. Due to the inherent complexity of this problem, we argue that identifying the causal mechanisms behind CO cannot be done through observations alone and requires active interventions Ilyas et al. (2019) . That is, we need to be able to synthetically induce CO in settings where it would not naturally happen otherwise. In this work, we identify one such type of intervention that allows to perform abundant experiments to explain multiple aspects of CO. Specifically, the main contributions of our work are: (i) We show that CO can be induced by injecting features that, despite being strongly discriminative (i.e. useful for standard classification), are not sufficient for robust classification (see Fig. 1 ). (ii) Through extensive empirical analysis, we discover that CO is connected to the preference of the network to learn different features in a dataset. (iii) Building upon these insights, we describe and analyse a causal chain of events that can lead to CO. The main message of our paper is: Catastrophic overfitting is a learning shortcut used by the network to avoid learning complex robust features while achieving high accuracy using easy non-robust ones. Our findings improve our understanding of CO by focusing on how data influences AT. Moreover, they also provide insights in the dynamics of AT, in which the interaction between robust and non-robust features plays a key role. Outline In Section 2, we give an overview of the related work on CO. Section 3 presents our main observation: CO can be induced by manipulating the data distribution. In Section 4, we perform an in-depth analysis of this phenomenon to identify the causes of CO. Finally, in Section 5 we use our new perspective to provide new insights on the different ways we can prevent CO.

2. PRELIMINARIES AND RELATED WORK

Let f θ : R d → Y denote a neural network architecture parameterized by a set of weights θ ∈ R n which maps input samples x ∈ R d to y ∈ Y = {1, . . . , c}. The objective of adversarial training (AT) is to find the network parameters θ ∈ R n that optimize the following min-max problem: min θ E (x,y)∼D max ∥δ∥p≤ϵ L(f θ (x + δ), y) , where D is some data distribution, δ ∈ R d represents an adversarial perturbation, and p, ϵ characterize the adversary. This is typically solved by alternately minimizing the outer objective and maximizing the inner one via first-order optimization procedures. The outer minimization is tackled via some standard optimizer, e.g., SGD, while the inner maximization problem is approximated with adversarial attacks like Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) and Projected Gradient Descent (PGD) (Madry et al., 2018) . Single-step AT methods are built on top of FGSM. In particular, FGSM solves the linearised version of the inner maximization objective. When p = ∞, this leads to: δ FGSM = argmax ∥δ∥∞≤ϵ L(f θ (x), y) + δ ⊤ ∇ x L(f θ (x), y) = ϵ sign (∇ x L(f θ (x), y)) . Note that FGSM is very computationally efficient as it only requires a single forward-backward step. Unfortunately, FGSM-AT generally yields networks that are vulnerable to multi-step attacks such as PGD. In particular, Wong et al. ( 2020) observed that FGSM-AT presents a characteristic failure mode where the robustness of the model increases during the initial training epochs, but, at a certain point in training, the model loses all its robustness within the span of a few iterations. This is known as catastrophic overfitting (CO). They further observed that augmenting the FGSM attack with random noise seemed to mitigate CO. However, Andriushchenko & Flammarion (2020) showed that this method still leads to CO at larger ϵ. Therefore, they proposed combining FGSM-AT with



Figure 1: Left: Depiction of our modified dataset that injects simple, discriminative features. Right: Clean and robust performance after FGSM-AT on injected datasets D β . We vary the strength of the synthetic features β (β = 0 corresponds to the original CIFAR-10) and the robustness budget ϵ (train and test). We observe that for ϵ ∈ { 4 /255, 6 /255} our intervention can induce CO when the synthetic features have strength β slightly larger than ϵ while training on the original data does not suffer CO. Results are averaged over 3 seeds and shaded areas report minimum and maximum values.

