WHY ADVERSARIAL TRAINING CAN HURT ROBUST ACCURACY

Abstract

Machine learning classifiers with high test accuracy often perform poorly under adversarial perturbations. It is commonly believed that adversarial training alleviates this issue. In this paper, we demonstrate that, surprisingly, the opposite can be true for a natural class of perceptible perturbations -even though adversarial training helps when enough data is available, it may in fact hurt robust generalization in the small sample size regime. We first prove this phenomenon for a high-dimensional linear classification setting with noiseless observations. Using intuitive insights from the proof, we could find perturbations on standard image datasets for which this behavior persists. Specifically, it occurs for perceptible perturbations that effectively reduce class information such as object occlusions or corruptions.

1. INTRODUCTION

On the other hand, perturbations that naturally occur and are physically realizable are commonly perceptible. Some perceptible perturbations specifically target the object to be recognized: these include occlusions (e.g. stickers placed on traffic signs (Eykholt et al., 2018) or masks of different sizes that cover important features of human faces (Wu et al., 2020) ) or corruptions that are caused by the image capturing process (animals that move faster than the shutter speed or objects that are not well-lit, see Figure 2 ). Others transform the whole image and are not confined to the object itself, such as rotations, translations or corruptions Engstrom et al. (2019) ; Kang et al. (2019) . In this paper, we refer to such perceptible attacks as directed attacks. In contrast to other attacks, they effectively reduce useful class information in the input for any model, without necessarily changing the true label -we say that they are directed and consistent, more formally defined in Section 2. For example, a stop sign with a small sticker could partially cover the text without losing its semantic meaning. Similarly, a flying bird captured with a long exposure time can induce motion blur in the final image without becoming unrecognizable to the observer. 1



Figure 1: On the Waterbirds dataset attacked by the adversarial illumination attack, adversarial training (yellow) yields higher robust error than standard training (blue) when the sample size is small, even though it helps for large sample sizes and in a setting where the standard error of standard training is small. (see App. D for details). Today's best-performing classifiers are vulnerable to adversarial attacks Goodfellow et al. (2015); Szegedy et al. (2014) and exhibit high robust error: for many inputs, their predictions change under adversarial perturbations, even though the true class stays the same. Such contentpreserving (Gilmer et al., 2018), consistent (Raghunathan et al., 2020) attacks can be either perceptible or imperceptible. For image datasets, most work to date studies imperceptible attacks that are based on perturbations with limited strength or attack budget. These include bounded p -norm perturbations (Goodfellow et al., 2015; Madryet al., 2018; Moosavi-Dezfooli et al., 2016), small transformations using image processing techniques(Ghiasi et al.,  2019; Zhao et al., 2020; Laidlaw et al., 2021; Luo et al.,  2018)  or nearby samples on the data manifold(Lin et al.,  2020; Zhou et al., 2020). Even though they do not visibly change the image by definition, imperceptible attacks can often successfully fool a learned classifier.

