TOWARDS ROBUSTNESS AGAINST UNSUSPICIOUS AD-VERSARIAL EXAMPLES

Abstract

Despite the remarkable success of deep neural networks, significant concerns have emerged about their robustness to adversarial perturbations to inputs. While most attacks aim to ensure that these are imperceptible, physical perturbation attacks typically aim for being unsuspicious, even if perceptible. However, there is no universal notion of what it means for adversarial examples to be unsuspicious. We propose an approach for modeling suspiciousness by leveraging cognitive salience. Specifically, we split an image into foreground (salient region) and background (the rest), and allow significantly larger adversarial perturbations in the background, while ensuring that cognitive salience of background remains low. We describe how to compute the resulting non-salience-preserving dual-perturbation attacks on classifiers. We then experimentally demonstrate that our attacks indeed do not significantly change perceptual salience of the background, but are highly effective against classifiers robust to conventional attacks. Furthermore, we show that adversarial training with dual-perturbation attacks yields classifiers that are more robust to these than state-of-the-art robust learning approaches, and comparable in terms of robustness to conventional attacks.

1. INTRODUCTION

An observation by Szegedy et al. (2014) that state-of-the-art deep neural networks that exhibit exceptional performance in image classification are fragile in the face of small adversarial perturbations of inputs has received a great deal of attention. A series of approaches for designing adversarial examples followed (Szegedy et al., 2014; Goodfellow et al., 2015; Carlini & Wagner, 2017) , along with methods for defending against them (Papernot et al., 2016b; Madry et al., 2018) , and then new attacks that defeat prior defenses, and so on. Attacks can be roughly classified along three dimensions: 1) introducing small l p -norm-bounded perturbations, with the goal of these being imperceptible to humans (Madry et al., 2018) , 2) using non-l p -based constraints that capture perceptibility (often called semantic perturbations) (Bhattad et al., 2020) , and 3) modifying physical objects, such as stop signs (Eykholt et al., 2018) , in a way that does not arouse suspicion. One of the most common motivations for the study of adversarial examples is safety and security, such as the potential for attackers to compromise the safety of autonomous vehicles that rely on computer vision (Eykholt et al., 2018) . However, while imperceptibility is certainly sufficient for perturbations to be unsuspicious, it is far from necessary, as physical attacks demonstrate. On the other hand, while there are numerous formal definitions that capture whether noise is perceptible (Moosavi-Dezfooli et al., 2016; Carlini & Wagner, 2017) , what makes adversarial examples suspicious has been largely informal and subjective. We propose a simple formalization of an important aspect of what makes adversarial perturbations unsuspicious. Specifically, we make a distinction between image foreground and background, allowing significantly more noise in the background than the foreground. This idea stems from the notion of cognitive salience (Borji et al., 2015; Kmmerer et al., 2017; He & Pugeault, 2018) , whereby an image can be partitioned into the two respective regions to reflect how much attention a human viewer pays to the different parts of the captured scene. In effect, we posit that perturbations in the foreground, when visible, will arouse significantly more suspicion (by being cognitively more salient) than perturbations made in the background.

Dual-per tur bation Example ( ) Dual-per tur bation Example ( ) Or iginal Sample

For egr ound Mask Our first contribution is a formal model of such dual-perturbation attacks, which is a generalization of the l p -norm-bounded attack models (see, e.g., Figure 1 ), but explicitly aims to ensure that adversarial perturbation does not make the background highly salient. Second, we propose an algorithm for finding adversarial examples using this model, which is an adaptation of the PGD attack (Madry et al., 2018) . Third, we present a method for defending against dual-perturbation attacks based on the adversarial training framework (Madry et al., 2018) . Finally, we present an extensive experimental study that demonstrates that (a) the proposed attacks are significantly stronger than PGD, successfully defeating all state-of-the-art defenses, (b) proposed defenses using our attack model significantly outperform state-of-the-art alternatives, with relatively small performance degradation on nonadversarial instances, and (c) proposed defenses are comparable to, or better than alternatives even against traditional attacks, such as PGD. et al., 2014; Goodfellow et al., 2015; Papernot et al., 2016a; Moosavi-Dezfooli et al., 2016; Carlini & Wagner, 2017) . These approaches commonly generate adversarial perturbations within a bounded p norm so that the perturbations are imperceptible. A related thread has considered the problem of generating adversarial examples that are semantically imperceptible without being small in norm (Brown et al., 2018; Bhattad et al., 2020) , for example, through small perturbations to the color scheme. However, none of these account for the perceptual distinction between the foreground and background of images. Numerous approaches have been proposed for defending neural networks against adversarial examples (Papernot et al., 2016b; Carlini & Wagner, 2017; Madry et al., 2018; Cohen et al., 2019; Madry et al., 2018; Raghunathan et al., 2018) . Predominantly, these use p -bounded perturbations as the threat model, and while some account for semantic perturbations (e.g. Mohapatra et al. ( 2020)), none consider perceptually important difference in suspiciousness between foreground and background. 

2.1. ADVERSARIAL EXAMPLES AND ATTACKS

The problem of generating adversarial examples is commonly modeled as follows. We are given a a learned model h θ (•) parameterized by θ which maps an input x to a k-dimensional prediction, where k is the number of classes being predicted. The final predicted class y p is obtained by y p = arg max i h θ (x) i , where h θ (x) i is the ith element of h θ (x). Now, consider an input x along with a correct label y. The problem of identifying an adversarial example for x can be captured by



Figure 1: An illustration of dual-perturbation attacks. Adversarial examples are with large ∞ perturbations on the background ( B = 20/255) and small ∞ perturbations on the foreground ( F = 4/255). A parameter λ is used to control background salience explicitly. A larger λ results in less salient background under the same magnititude of perturbation.

Two recent approaches byVaishnavi et al. (2019)  andBrama & Grinshpoun (2020)  have the strongest conceptual connection to our work. Both are defense-focused by either eliminating(Vaishnavi et al.,  2019)  or blurring (Brama & Grinshpoun, 2020) the background region for robustness. However, they assume that we can reliably segment an image at prediction time, leaving the approach vulnerable to attacks on image segmentation(Arnab et al., 2018).Xiao et al. (2020)  propose to disentangle foreground and background signals on images but unsuspiciousness of their attacks is not ensured.

