TOWARDS ROBUSTNESS AGAINST UNSUSPICIOUS AD-VERSARIAL EXAMPLES

Abstract

Despite the remarkable success of deep neural networks, significant concerns have emerged about their robustness to adversarial perturbations to inputs. While most attacks aim to ensure that these are imperceptible, physical perturbation attacks typically aim for being unsuspicious, even if perceptible. However, there is no universal notion of what it means for adversarial examples to be unsuspicious. We propose an approach for modeling suspiciousness by leveraging cognitive salience. Specifically, we split an image into foreground (salient region) and background (the rest), and allow significantly larger adversarial perturbations in the background, while ensuring that cognitive salience of background remains low. We describe how to compute the resulting non-salience-preserving dual-perturbation attacks on classifiers. We then experimentally demonstrate that our attacks indeed do not significantly change perceptual salience of the background, but are highly effective against classifiers robust to conventional attacks. Furthermore, we show that adversarial training with dual-perturbation attacks yields classifiers that are more robust to these than state-of-the-art robust learning approaches, and comparable in terms of robustness to conventional attacks.

1. INTRODUCTION

An observation by Szegedy et al. (2014) that state-of-the-art deep neural networks that exhibit exceptional performance in image classification are fragile in the face of small adversarial perturbations of inputs has received a great deal of attention. A series of approaches for designing adversarial examples followed (Szegedy et al., 2014; Goodfellow et al., 2015; Carlini & Wagner, 2017) , along with methods for defending against them (Papernot et al., 2016b; Madry et al., 2018) , and then new attacks that defeat prior defenses, and so on. Attacks can be roughly classified along three dimensions: 1) introducing small l p -norm-bounded perturbations, with the goal of these being imperceptible to humans (Madry et al., 2018) , 2) using non-l p -based constraints that capture perceptibility (often called semantic perturbations) (Bhattad et al., 2020), and 3) modifying physical objects, such as stop signs (Eykholt et al., 2018) , in a way that does not arouse suspicion. One of the most common motivations for the study of adversarial examples is safety and security, such as the potential for attackers to compromise the safety of autonomous vehicles that rely on computer vision (Eykholt et al., 2018) . However, while imperceptibility is certainly sufficient for perturbations to be unsuspicious, it is far from necessary, as physical attacks demonstrate. On the other hand, while there are numerous formal definitions that capture whether noise is perceptible (Moosavi-Dezfooli et al., 2016; Carlini & Wagner, 2017) , what makes adversarial examples suspicious has been largely informal and subjective. We propose a simple formalization of an important aspect of what makes adversarial perturbations unsuspicious. Specifically, we make a distinction between image foreground and background, allowing significantly more noise in the background than the foreground. This idea stems from the notion of cognitive salience (Borji et al., 2015; Kmmerer et al., 2017; He & Pugeault, 2018) , whereby an image can be partitioned into the two respective regions to reflect how much attention a human viewer pays to the different parts of the captured scene. In effect, we posit that perturbations in the foreground, when visible, will arouse significantly more suspicion (by being cognitively more salient) than perturbations made in the background.

