NON-ROBUST FEATURES THROUGH THE LENS OF UNIVERSAL PERTURBATIONS

Abstract

Recent work ties adversarial examples to existence of non-robust features: features which are susceptible to small perturbations and believed to be unintelligible to humans, but still useful for prediction. We study universal adversarial perturbations and demonstrate that the above picture is more nuanced. Specifically, even though universal perturbations-similarly to standard adversarial perturbations-do leverage non-robust features, these features tend to be fundamentally different from the "standard" ones and, in particular, non-trivially human-aligned. Namely, universal perturbations have more human-aligned locality and spatial invariance properties. However, we also show that these human-aligned non-robust features have much less predictive signal than general non-robust features. Our findings thus take a step towards improving our understanding of these previously unintelligible features.

1. INTRODUCTION

Modern deep neural networks perform extremely well across many prediction tasks, but they largely remain vulnerable to adversarial examples (Szegedy et al., 2014) . Models' brittleness to these small, imperceptible perturbations highlights one alarming way in which models deviate from humans. Recent work gives evidence that this deviation is due to the presence of useful non-robust features in our datasets (Ilyas et al., 2019) . These are brittle features that are sensitive to small perturbations-too small to be noticeable to humans, yet capture enough predictive signal to generalize well on the underlying classification task. When models rely on non-robust features, they become vulnerable to adversarial examples, as even small perturbations can flip the features' signal. While prior work gives evidence of non-robust features in natural datasets, we lack a more finegrained understanding of their properties. In general, we do not understand well how models make decisions, so it is unclear how much we can understand about these features that are believed to be imperceptible. A number of works suggest that these features may exploit certain properties of the dataset that are misaligned with human perception (Ilyas et al., 2019) , such as high-frequency information (Yin et al., 2019) , but much remains unknown. In this work, we illustrate how we can isolate more human-aligned non-robust features by imposing additional constraints on adversarial perturbations. In particular, we revisit universal adversarial perturbations (Moosavi-Dezfooli et al., 2017a) , i.e. adversarial perturbations that generalize across many inputs. Prior works have observed that these perturbations appear to be semantic (Hayes & Danezis, 2019; Khrulkov & Oseledets, 2018; Liu et al., 2019) . We demonstrate that universal perturbations possess additional human-aligned properties different from standard adversarial perturbations, and analyze the non-robust features leveraged by these perturbations. Concretely, our findings are: Universal perturbations have more human-aligned properties. We show that universality adversarial perturbations have additional human-aligned properties that distinguish them from standard adversarial perturbations (e.g., Figure 1 ). In particular, (1) the most semantically identifiable local patches inside universal perturbations also contain the most signal; and (2) universal perturbations are approximately spatially invariant, in that they are still effective after translations. Non-robust features can be semantically meaningful. We show that universal perturbations are primarily relying on non-robust features rather than robust ones. Specifically, we compare the sensitivity of natural and (adversarially) robust models to rescalings of these perturbations to demonstrate that universal perturbations likely rely on non-robust features. Together with our first finding, this shows that some non-robust features can be human-aligned. Universal perturbations contain less non-robust signal. We find that the non-robust features leveraged by universal perturbations have less predictive signal than those leveraged by standard adversarial perturbations, despite being more human-aligned. We measure both (1) generalizability to the original test set and (2) transferability of perturbations across independent models, following the methodology of Ilyas et al. (2019) . Under these metrics, universal perturbations consistently obtain non-trivial but substantially worse performance than standard adversarial perturbations.

2. PRELIMINARIES

We consider a standard classification task: given input-label samples (x, y) ∈ X × Y from a data distribution D, the goal is to to learn a classifier C : X → Y that generalizes to new data. Non-robust vs. robust features. Following Ilyas et al. ( 2019), we introduce the following terminology. A useful feature for classification is a function that is (positively) correlated with the correct label in expectation. A feature is robustly useful if, even under adversarial perturbations (within a specified set of valid perturbations ∆) the feature is still useful. Finally, a useful, non-robust feature is a feature that is useful but not robustly useful. These features are useful for classification in the standard setting, but can hurt accuracy in the adversarial setting (since their correlation with the label can be reversed). For conciseness, throughout this paper we will refer to such features simply as non-robust features. Universal perturbations. A universal adversarial perturbation (or just universal perturbation for short), as introduced in Moosavi- Dezfooli et al. (2017a) , is a perturbation δ that causes the classifier C to predict the wrong label on a large fraction of inputs from D. We focus on targeted universal perturbations that fool C into predicting a specific (usually incorrect) target label t. Thus, a (targeted) universal perturbation δ ∈ ∆ satisfies P (x,y)∼D [C(x + δ) = t] = ρ, where ρ is the attack success rate (ASR) of the universal perturbation. The only technical difference between universal perturbations and (standard) adversarial perturbations is the use of a single perturbation vector δ that is applied to all inputs. Thus, one can think of universality as a constraint for the perturbation δ to be inputindependent. Universal perturbations can leverage non-robust features in data, as we show here; we refer to these simply as universal non-robust features. p perturbations. We study the case where ∆ is the set of p -bounded perturbations, i.e. ∆ = {δ ∈ R d | ||δ|| p ≤ } for p = 2, ∞. This is the most widely studied setting for research on adversarial examples and has proven to be an effective benchmark (Carlini et al., 2019) . Additionally, p -robustness appears to be aligned to a certain degree with the human visual system (Tsipras et al., 2019) .

2.1. COMPUTING UNIVERSAL PERTURBATIONS

We compute universal perturbations by using projected gradient descent (PGD) on the following optimization problem: min δ∈∆ E (x,y)∼D L(f (x + δ), t) where L is the standard cross-entropy loss function for classification, and f are the logits prior to classification. and t is the target label. While many different algorithms have been developed for



Figure 1: 2 adversarial perturbations ( = 6.0) for two target classes, bird and feline, on ImageNet-M10: (a,d) standard adversarial perturbations on a single image, (b,e) universal adversarial perturbations, and (c,f) zooming in on the most semantic patch.

