TESTING ROBUSTNESS AGAINST UNFORESEEN AD-VERSARIES

Abstract

Most existing adversarial defenses only measure robustness to L p adversarial attacks. Not only are adversaries unlikely to exclusively create small L p perturbations, adversaries are unlikely to remain fixed. Adversaries adapt and evolve their attacks; hence adversarial defenses must be robust to a broad range of unforeseen attacks. We address this discrepancy between research and reality by proposing a new evaluation framework called ImageNet-UA. Our framework enables the research community to test ImageNet model robustness against attacks not encountered during training. To create ImageNet-UA's diverse attack suite, we introduce a total of four novel adversarial attacks. We also demonstrate that, in comparison to ImageNet-UA, prevailing L ∞ robustness assessments give a narrow account of adversarial robustness. By evaluating current defenses with ImageNet-UA, we find they provide little robustness to unforeseen attacks. We hope the greater variety and realism of ImageNet-UA enables development of more robust defenses which can generalize beyond attacks seen during training.

1. INTRODUCTION

Neural networks perform well on many datasets (He et al., 2016) yet can be consistently fooled by minor adversarial distortions (Goodfellow et al., 2014) . The research community has responded by quantifying and developing adversarial defenses against such attacks (Madry et al., 2017) , but these defenses and metrics have two key limitations. First, the vast majority of existing defenses exclusively defend against and quantify robustness to L p -constrained attacks (Madry et al., 2017; Cohen et al., 2019; Raff et al., 2019; Xie et al., 2018) . Though real-world adversaries are not L p constrained (Gilmer et al., 2018) and can attack with diverse distortions (Brown et al., 2017; Sharif et al., 2019) , the literature largely ignores this and evaluates against the L p adversaries already seen during training (Madry et al., 2017; Xie et al., 2018) , resulting in optimistic robustness assessments. The attacks outside the L p threat model that have been proposed (Song et al., 2018; Qiu et al., 2019; Engstrom et al., 2017; Evtimov et al., 2017; Sharif et al., 2016) are not intended for general defense evaluation and suffer from narrow dataset applicability, difficulty of optimization, or fragility of auxiliary generative models. Second, existing defenses assume that attacks are known in advance (Goodfellow, 2019) and use knowledge of their explicit form during training (Madry et al., 2017) . In practice, adversaries can deploy unforeseen attacks not known to the defense creator. For example, online advertisers use attacks such as perturbed pixels in ads to defeat ad blockers trained only on the previous generation of ads in an ever-escalating arms race (Tramèr et al., 2018) . However, current evaluation setups implicitly assume that attacks encountered at test-time are the same as those seen at train-time, which is unrealistic. The reality that future attacks are unlike those encountered during training is akin to a train-test distribution mismatch-a problem studied outside of adversarial robustness (Recht et al., 2019; Hendrycks & Dietterich, 2019) -but now brought to the adversarial setting. The present work addresses these limitations by proposing an evaluation framework ImageNet-UA to measure robustness against unforeseen attacks. ImageNet-UA assesses a defense which may have been created with knowledge of the commonly used L ∞ or L 2 attacks with six diverse attacks (four of which are novel) distinct from L ∞ or L 2 . We intend these attacks to be used at test-time only and not during training. Performing well on ImageNet-UA thus demonstrates generalization to a diverse set of distortions not seen during defense creation. While ImageNet-UA

