ROBUST UNIVERSAL ADVERSARIAL PERTURBATIONS

Abstract

Universal Adversarial Perturbations (UAPs) are imperceptible, image-agnostic vectors that cause deep neural networks (DNNs) to misclassify inputs from a data distribution with high probability. In practical attack scenarios, adversarial perturbations may undergo transformations such as changes in pixel intensity, rotation, etc. while being added to DNN inputs. Existing methods do not create UAPs robust to these real-world transformations, thereby limiting their applicability in attack scenarios. In this work, we introduce and formulate robust UAPs. We build an iterative algorithm using probabilistic robustness bounds and transformations generated by composing arbitrary sub-differentiable transformation functions to construct such robust UAPs. We perform an extensive evaluation on the popular CIFAR-10 and ILSVRC 2012 datasets measuring our UAPs' robustness under a wide range common, real-world transformations such as rotation, contrast changes, etc. Our results show that our method can generate UAPs up to 23% more robust than existing state-of-the-art baselines.

1. INTRODUCTION

Deep neural networks (DNNs) have achieved impressive results in many application domains such as natural language processing (Abdel-Hamid et al., 2014; Brown et al., 2020) , medicine (Esteva et al., 2017; 2019), and computer vision (Simonyan & Zisserman, 2014; Szegedy et al., 2016) . Despite their performance, they can be fragile in the face of adversarial perturbations: small imperceptible changes added to a correctly classified input that make a DNN misclassify. While there is a large amount of work on generating adversarial perturbations (Szegedy et al., 2013; Goodfellow et al., 2014; Moosavi-Dezfooli et al., 2016; Madry et al., 2017; Carlini & Wagner, 2017; Xiao et al., 2018a; Dong et al., 2018; Croce & Hein, 2019; Wang et al., 2019; Zheng et al., 2019; Andriushchenko et al., 2019; Tramèr et al., 2020) , the threat model considered by these works cannot be realized in practical scenarios. This is because the threat model depends upon unrealistic assumptions about the power of the attacker: the attacker knows the DNN input in advance, generates input-specific perturbations in real-time and exactly combines the perturbation with the input before being processed by the DNN. Practically feasible adversarial perturbations. In this work, we consider a more practical adversary to reveal real-world vulnerabilities of state-of-the-art DNNs. We assume that the attacker (i) does not know the DNN inputs in advance, (ii) can only transmit additive adversarial perturbations, and (iii) their transmitted perturbations are susceptible to modification due to real-world effects. Examples of attacks in our threat model include adding stickers to the cameras for fooling image classifiers (Li et al., 2019b) or transmitting perturbations over the air for deceiving audio classifiers (Li et al., 2019a) . Note that this threat model is distinct from directly generating adversarial examples (Athalye et al., 2018) which require access to the original input. The first two requirements in our threat model can be fulfilled by generating Universal Adversarial Perturbations (UAPs) (Moosavi-Dezfooli et al., 2017) . Here the attacker can train a single adversarial perturbation that has a high probability of being adversarial on all inputs in the training distribution. However, as our experimental results show, the generated UAPs need to be combined with the DNN inputs precisely, otherwise they fail to remain adversarial. In practice, changes to UAPs are likely due to real-world effects. For example, the stickers applied to a camera can undergo changes in contrast due to weather conditions or the transmitted perturbation in audio can change due to noise in the transmission channel. This non-robustness reduces the efficiency of practical attacks created with existing methods (Moosavi-Dezfooli et al., 2017; Shafahi et al., 2020; Li et al., 2019b; a) . This work: Robust UAPs. To overcome the above limitation, we propose the concept of robust UAPs: perturbations that have a high probability of remaining adversarial on inputs in the training distribution even after applying a set of real-world transformations. The optimization problem in generating robust UAPs (Moosavi-Dezfooli et al., 2017) is the main challenge as we are looking for perturbations that are adversarial for a set of inputs as well as to transformations applied to the perturbations. To address this challenge, we make the following main contributions: • We introduce Robust UAPs and formulate their generation as an optimization problem. • We design a new method for constructing robust UAPs. Our method is general and works for any transformations generated by composing arbitrary sub-differentiable transformation functions. We provide an algorithm for computing provable probabilistic bounds on the robustness of our UAPs against many practical transformations. • We perform an extensive evaluation of the effectiveness of our method, RobustUAP, on stateof-the-art models for the popular CIFAR-10 (Krizhevsky et al., 2009 ) and ILSVRC 2012 (Deng et al., 2009) datasets. We compare the robustness of our UAPs under compositions of challenging real-world transformations, such as rotation, contrast change, etc. We show that on both datasets, the UAPs generated by RobustUAP are significantly more robust, achieving up to 23% more robustness, than the UAPs generated from the baselines. Our work is complementary to the development of real-world attacks (Li et al., 2019a; b) in various domains, which require modeling how the universal perturbations change during transmission. RobustUAP can improve the efficiency of such attacks by constructing perturbations that are more robust against domain-specific, real-world transformations than possible with existing algorithms (Moosavi-Dezfooli et al., 2017; Shafahi et al., 2020; Li et al., 2019a; b) .

2. BACKGROUND

In this section, we provide necessary background definitions and notation for our work. Adversarial Examples and Perturbations. An adversarial example is a misclassified data point that is close (in some norm) to a correctly classified data point (Goodfellow et al., 2014; Madry et al., 2017; Carlini & Wagner, 2017) . Let µ ⊂ R d be the input data distribution, x ∈ µ be an input point with the corresponding true label y ∈ R, and f : R d → R d ′ be our target classifier. For ease of notation, we define f k (x) to be the k th element of f (x) and allow f (x) = arg max k f k (x) to directly refer to the classification label. We use v to reference image specific perturbations and u to reference universal adversarial perturbations, v r and u r refer to the robust variants and will be defined in Sec. 3. We now formally define an adversarial example. In this paper, we consider examples x ′ generated as x ′ = x+v where v is an adversarial perturbation.



Figure 1: Robust UAPs (left) cause a classier to misclassify on most of the data distribution even after transformations are applied on them. Standard UAPs (right) are not robust to transformations and have a low probability of remaining UAPs after transformation.

Definition 2.1. Given a correctly classified point x, a distance function d(•, •) : R d × R d → R, and bound ϵ ∈ R, x ′ is an adversarial example iff d(x ′ , x) < ϵ and f (x ′ ) ̸ = y.

