WHY ADVERSARIAL TRAINING CAN HURT ROBUST ACCURACY

Abstract

Machine learning classifiers with high test accuracy often perform poorly under adversarial perturbations. It is commonly believed that adversarial training alleviates this issue. In this paper, we demonstrate that, surprisingly, the opposite can be true for a natural class of perceptible perturbations -even though adversarial training helps when enough data is available, it may in fact hurt robust generalization in the small sample size regime. We first prove this phenomenon for a high-dimensional linear classification setting with noiseless observations. Using intuitive insights from the proof, we could find perturbations on standard image datasets for which this behavior persists. Specifically, it occurs for perceptible perturbations that effectively reduce class information such as object occlusions or corruptions.

1. INTRODUCTION

attacks can be either perceptible or imperceptible. For image datasets, most work to date studies imperceptible attacks that are based on perturbations with limited strength or attack budget. These include bounded p -norm perturbations (Goodfellow et al., 2015; Madry et al., 2018; Moosavi-Dezfooli et al., 2016) , small transformations using image processing techniques (Ghiasi et al., 2019; Zhao et al., 2020; Laidlaw et al., 2021; Luo et al., 2018) or nearby samples on the data manifold (Lin et al., 2020; Zhou et al., 2020) . Even though they do not visibly change the image by definition, imperceptible attacks can often successfully fool a learned classifier. On the other hand, perturbations that naturally occur and are physically realizable are commonly perceptible. Some perceptible perturbations specifically target the object to be recognized: these include occlusions (e.g. stickers placed on traffic signs (Eykholt et al., 2018) or masks of different sizes that cover important features of human faces (Wu et al., 2020) ) or corruptions that are caused by the image capturing process (animals that move faster than the shutter speed or objects that are not well-lit, see Figure 2 ). Others transform the whole image and are not confined to the object itself, such as rotations, translations or corruptions Engstrom et al. (2019) ; Kang et al. (2019) . In this paper, we refer to such perceptible attacks as directed attacks. In contrast to other attacks, they effectively reduce useful class information in the input for any model, without necessarily changing the true label -we say that they are directed and consistent, more formally defined in Section 2. For example, a stop sign with a small sticker could partially cover the text without losing its semantic meaning. Similarly, a flying bird captured with a long exposure time can induce motion blur in the final image without becoming unrecognizable to the observer. In contrast, we show that adversarial training not only increases standard error (Zhang et al., 2019; Tsipras et al., 2019; Stutz et al., 2019; Raghunathan et al., 2020) , but surprisingly, in the low sample regime, adversarial training may even increase the robust error compared to standard training! Figure 1 illustrates the main message of our paper on the Waterbirds dataset: Although adversarial training with directed attacks outperforms standard training when enough training samples are available, it is inferior when the sample size is small (but still large enough to obtain a small standard test error). Our contributions are as follows: • We prove that, almost surely, adversarially training a linear classifier on separable data yields a monotonically increasing robust error as the perturbation budget grows. We further establish high-probability non-asymptotic lower bounds on the robust error gap between adversarial and standard training. • Our proof provides intuition for why this lower bound on the gap is particularly large for directed attacks in the low sample regime. • We observe empirically for different directed attacks on real-world image datasets that this behavior persists: adversarial training for directed attacks hurts robust accuracy when the sample size is small.

2. ROBUST CLASSIFICATION

We first introduce our robust classification setting more formally by defining the notions of adversarial robustness, directed attacks and adversarial training used throughout the paper. Adversarially robust classifiers For inputs x ∈ R d , we consider multi-class classifiers associated with parameterized functions f θ : R d → R K if K > 2 and f θ : R d → R if K = 2 , where K is the number of labels. For example, f θ (x) could be a linear model (as in Section 3) or a neural network (as in Section 4). The output label predictions are obtained by h(f θ (x)) = sign(f θ (x)) for K = 2 and h(f θ (x)) = arg max k∈{1,..,K} f θ (x) k for K > 2. In order to convince practitioners to use machine learning models in the wild, it is key to demonstrate that they exhibit robustness. One kind of robustness is that they do not change prediction when the input is subject to consistent perturbations, which are small class-preserving perturbations. Mathematically speaking, for the underlying joint data distribution P, the model should have a small te -robust error, defined as Err(θ; te ) := E (x,y)∼P max x ∈T (x; te ) (f θ (x ), y), where is 0 if the class determined by h(f θ (x)) is equal to y and 1 otherwise. Further, T (x; te ) indicates a perturbation set around x of a certain transformation type with size test . Note that



Figure 1: On the Waterbirds dataset attacked by the adversarial illumination attack, adversarial training (yellow) yields higher robust error than standard training (blue) when the sample size is small, even though it helps for large sample sizes and in a setting where the standard error of standard training is small. (see App. D for details). Today's best-performing classifiers are vulnerable to adversarial attacks Goodfellow et al. (2015); Szegedy et al. (2014) and exhibit high robust error: for many inputs, their predictions change under adversarial perturbations, even though the true class stays the same. Such contentpreserving (Gilmer et al., 2018), consistent (Raghunathan et al., 2020) attacks can be either perceptible or imperceptible. For image datasets, most work to date studies imperceptible attacks that are based on perturbations with limited strength or attack budget. These include bounded p -norm perturbations (Goodfellow et al., 2015; Madryet al., 2018; Moosavi-Dezfooli et al., 2016), small transformations using image processing techniques(Ghiasi et al.,  2019; Zhao et al., 2020; Laidlaw et al., 2021; Luo et al.,  2018)  or nearby samples on the data manifold(Lin et al.,  2020; Zhou et al., 2020). Even though they do not visibly change the image by definition, imperceptible attacks can often successfully fool a learned classifier.

Figure 2: Examples of directed attacks on CIFAR10 and the Waterbirds dataset. In Figure 2a, we corrupt the image with a black mask of size 2 × 2 and in Figure 2c and 2d we change the lighting conditions (darkening) and apply motion blur on the bird in the image respectively. All perturbations reduce the information about the class in the images: they are the result of directed attacks. (e) Directed attacks are a subset of perceptible attacks.In the literature so far, it is widely acknowledged that adversarial training with the same perturbation type and budget as during test time often achieves significantly lower robust error than standard training(Madry et al., 2018; Zhang et al., 2019; Bai et al., 2021).

