NON-ROBUST FEATURES THROUGH THE LENS OF UNIVERSAL PERTURBATIONS

Abstract

Recent work ties adversarial examples to existence of non-robust features: features which are susceptible to small perturbations and believed to be unintelligible to humans, but still useful for prediction. We study universal adversarial perturbations and demonstrate that the above picture is more nuanced. Specifically, even though universal perturbations-similarly to standard adversarial perturbations-do leverage non-robust features, these features tend to be fundamentally different from the "standard" ones and, in particular, non-trivially human-aligned. Namely, universal perturbations have more human-aligned locality and spatial invariance properties. However, we also show that these human-aligned non-robust features have much less predictive signal than general non-robust features. Our findings thus take a step towards improving our understanding of these previously unintelligible features.

1. INTRODUCTION

Modern deep neural networks perform extremely well across many prediction tasks, but they largely remain vulnerable to adversarial examples (Szegedy et al., 2014) . Models' brittleness to these small, imperceptible perturbations highlights one alarming way in which models deviate from humans. Recent work gives evidence that this deviation is due to the presence of useful non-robust features in our datasets (Ilyas et al., 2019) . These are brittle features that are sensitive to small perturbations-too small to be noticeable to humans, yet capture enough predictive signal to generalize well on the underlying classification task. When models rely on non-robust features, they become vulnerable to adversarial examples, as even small perturbations can flip the features' signal. While prior work gives evidence of non-robust features in natural datasets, we lack a more finegrained understanding of their properties. In general, we do not understand well how models make decisions, so it is unclear how much we can understand about these features that are believed to be imperceptible. A number of works suggest that these features may exploit certain properties of the dataset that are misaligned with human perception (Ilyas et al., 2019) , such as high-frequency information (Yin et al., 2019), but much remains unknown. In this work, we illustrate how we can isolate more human-aligned non-robust features by imposing additional constraints on adversarial perturbations. In particular, we revisit universal adversarial perturbations (Moosavi-Dezfooli et al., 2017a) , i.e. adversarial perturbations that generalize across many inputs. Prior works have observed that these perturbations appear to be semantic (Hayes & Danezis, 2019; Khrulkov & Oseledets, 2018; Liu et al., 2019) . We demonstrate that universal perturbations



Figure 1: 2 adversarial perturbations ( = 6.0) for two target classes, bird and feline, on ImageNet-M10: (a,d) standard adversarial perturbations on a single image, (b,e) universal adversarial perturbations, and (c,f) zooming in on the most semantic patch.

