TESTING ROBUSTNESS AGAINST UNFORESEEN AD-VERSARIES

Abstract

Most existing adversarial defenses only measure robustness to L p adversarial attacks. Not only are adversaries unlikely to exclusively create small L p perturbations, adversaries are unlikely to remain fixed. Adversaries adapt and evolve their attacks; hence adversarial defenses must be robust to a broad range of unforeseen attacks. We address this discrepancy between research and reality by proposing a new evaluation framework called ImageNet-UA. Our framework enables the research community to test ImageNet model robustness against attacks not encountered during training. To create ImageNet-UA's diverse attack suite, we introduce a total of four novel adversarial attacks. We also demonstrate that, in comparison to ImageNet-UA, prevailing L ∞ robustness assessments give a narrow account of adversarial robustness. By evaluating current defenses with ImageNet-UA, we find they provide little robustness to unforeseen attacks. We hope the greater variety and realism of ImageNet-UA enables development of more robust defenses which can generalize beyond attacks seen during training.

1. INTRODUCTION

Neural networks perform well on many datasets (He et al., 2016) yet can be consistently fooled by minor adversarial distortions (Goodfellow et al., 2014) . The research community has responded by quantifying and developing adversarial defenses against such attacks (Madry et al., 2017) , but these defenses and metrics have two key limitations. First, the vast majority of existing defenses exclusively defend against and quantify robustness to L p -constrained attacks (Madry et al., 2017; Cohen et al., 2019; Raff et al., 2019; Xie et al., 2018) . Though real-world adversaries are not L p constrained (Gilmer et al., 2018) and can attack with diverse distortions (Brown et al., 2017; Sharif et al., 2019) , the literature largely ignores this and evaluates against the L p adversaries already seen during training (Madry et al., 2017; Xie et al., 2018) , resulting in optimistic robustness assessments. The attacks outside the L p threat model that have been proposed (Song et al., 2018; Qiu et al., 2019; Engstrom et al., 2017; Evtimov et al., 2017; Sharif et al., 2016) are not intended for general defense evaluation and suffer from narrow dataset applicability, difficulty of optimization, or fragility of auxiliary generative models. Second, existing defenses assume that attacks are known in advance (Goodfellow, 2019) and use knowledge of their explicit form during training (Madry et al., 2017) . In practice, adversaries can deploy unforeseen attacks not known to the defense creator. For example, online advertisers use attacks such as perturbed pixels in ads to defeat ad blockers trained only on the previous generation of ads in an ever-escalating arms race (Tramèr et al., 2018) . However, current evaluation setups implicitly assume that attacks encountered at test-time are the same as those seen at train-time, which is unrealistic. The reality that future attacks are unlike those encountered during training is akin to a train-test distribution mismatch-a problem studied outside of adversarial robustness (Recht et al., 2019; Hendrycks & Dietterich, 2019) -but now brought to the adversarial setting. The present work addresses these limitations by proposing an evaluation framework ImageNet-UA to measure robustness against unforeseen attacks. ImageNet-UA assesses a defense which may have been created with knowledge of the commonly used L ∞ or L 2 attacks with six diverse attacks (four of which are novel) distinct from L ∞ or L 2 . We intend these attacks to be used at test-time only and not during training. Performing well on ImageNet-UA thus demonstrates generalization to a diverse set of distortions not seen during defense creation. While ImageNet-UA L ∞ L 2 L 1 Elastic Our New Attacks JPEG Fog Snow Gabor Figure 1 : Adversarially distorted chow chow dog images created with old attacks and our new attacks. The JPEG, Fog, Snow, and Gabor adversarial attacks are visually distinct from previous attacks, result in distortions which do not obey a small L p norm constraint, and serve as unforeseen attacks for the ImageNet-UA attack suite. does not provide an exhaustive guarantee over all conceivable attacks, it evaluates over a diverse unforeseen test distribution similar to those used successfully in other studies of distributional shift (Rajpurkar et al., 2018; Hendrycks & Dietterich, 2019; Recht et al., 2019) . ImageNet-UA works for ImageNet models and can be easily used with our code available at https://github.com/ anon-submission-2020/anon-submission-2020. Designing ImageNet-UA requires new attacks that are strong and varied, since real-world attacks are diverse in structure. To meet this challenge, we contribute four novel and diverse adversarial attacks which are easily optimized. Our new attacks produce distortions with occlusions, spatial similarity, and simulated weather, all of which are absent in previous attacks. Performing well on ImageNet-UA thus demonstrates that a defense generalizes to a diverse set of distortions distinct from the commonly used L ∞ or L 2 . With ImageNet-UA, we show weaknesses in existing evaluation practices and defenses through a study of 8 attacks against 48 models adversarially trained on ImageNet-100, a 100-class subset of ImageNet. While most adversarial robustness evaluations use only L ∞ attacks, ImageNet-UA reveals that models with high L ∞ attack robustness can remain susceptible to other attacks. Thus, L ∞ evaluations are a narrow measure of robustness, even though much of the literature treats this evaluation as comprehensive (Madry et al., 2017; Qian & Wegman, 2019; Schott et al., 2019; Zhang et al., 2019) . We address this deficiency by using the novel attacks in ImageNet-UA to evaluate robustness to a more diverse set of unforeseen attacks. Our results demonstrate that L ∞ adversarial training, the current state-of-the-art defense, has limited generalization to unforeseen adversaries, and is not easily improved by training against more attacks. This adds to the evidence that achieving robustness against a few train-time attacks is insufficient to impart robustness to unforeseen test-time attacks (Jacobsen et al., 2019; Jordan et al., 2019; Tramèr & Boneh, 2019) . In summary, we propose the framework ImageNet-UA to measure robustness to a diverse set of attacks, made possible by our four new adversarial attacks. Since existing defenses scale poorly to multiple attacks (Jordan et al., 2019; Tramèr & Boneh, 2019) , finding defense techniques which generalize to unforeseen attacks is crucial to create robust models. We suggest ImageNet-UA as a way to measure progress towards this goal.

2. RELATED WORK

Adversarial robustness is notoriously difficult to correctly evaluate (Papernot et al., 2017; Athalye et al., 2018a) . To that end, Carlini et al. (2019a) provide extensive guidance for sound adversarial robustness evaluation. By measuring attack success rates across several distortion sizes and using a broader threat model with diverse differentiable attacks, ImageNet-UA has several of their recommendations built-in, while greatly expanding the set of attacks over previous work on evaluation.

