FINE-GRAINED SYNTHESIS OF UNRESTRICTED ADVERSARIAL EXAMPLES Anonymous

Abstract

We propose a novel approach for generating unrestricted adversarial examples by manipulating fine-grained aspects of image generation. Unlike existing unrestricted attacks that typically hand-craft geometric transformations, we learn stylistic and stochastic modifications leveraging state-of-the-art generative models. This allows us to manipulate an image in a controlled, fine-grained manner without being bounded by a norm threshold. Our approach can be used for targeted and non-targeted unrestricted attacks on classification, semantic segmentation and object detection models. Our attacks can bypass certified defenses, yet our adversarial images look indistinguishable from natural images as verified by human evaluation. Moreover, we demonstrate that adversarial training with our examples improves performance of the model on clean images without requiring any modifications to the architecture. We perform experiments on LSUN, CelebA-HQ and COCO-Stuff as high resolution datasets to validate efficacy of our proposed approach.

1. INTRODUCTION

Adversarial examples, inputs resembling real samples but maliciously crafted to mislead machine learning models, have been studied extensively in the last few years. Most of the existing papers, however, focus on normconstrained attacks and defenses, in which the adversarial input lies in an -neighborhood of a real sample using the L p distance metric (commonly with p = 0, 2, ∞). For small , the adversarial input is quasi-indistinguishable from the natural sample. For an adversarial image to fool the human visual system, it is sufficient to be normconstrained; but this condition is not necessary. Moreover, defenses tailored for norm-constrained attacks can fail on other subtle input modifications. This has led to a recent surge of interest on unrestricted adversarial attacks in which the adversary is not bounded by a norm threshold. These methods typically hand-craft transformations to capture visual similarity. Spatial transformations [Engstrom et al. (2017) ; Xiao et al. (2018) ; Alaifari et al. (2018) ], viewpoint or pose changes [Alcorn et al. (2018) ], inserting small patches [Brown et al. (2017) ], among other methods, have been proposed for unrestricted adversarial attacks. In this paper, we focus on fine-grained manipulation of images for unrestricted adversarial attacks. We build upon state-of-the-art generative models which disentangle factors of variation in images. We create fine and coarsegrained adversarial changes by manipulating various latent variables at different resolutions. Loss of the target network is used to guide the generation process. The pre-trained generative model constrains the search space for our adversarial examples to realistic images, thereby revealing the target model's vulnerability in the natural image space. We verify that we do not deviate from the space of realistic images with a user study as well as a t-SNE plot comparing distributions of real and adversarial images (see Fig. 7 in the appendix). As a result, we observe that including these examples in training the model enhances its accuracy on clean images. Our contributions can be summarized as follows: • We present the first method for fine-grained generation of high-resolution unrestricted adversarial examples in which the attacker controls which aspects of the image to manipulate, resulting in a diverse set of realistic, on-the-manifold adversarial examples. • We demonstrate that our proposed attack can break certified defenses on norm-bounded perturbations.

2.1. NORM-CONSTRAINED ADVERSARIAL EXAMPLES

Most of the existing works on adversarial attacks and defenses focus on norm-constrained adversarial examples: for a given classifier F : R n → {1, . . . , K} and an image x ∈ R n , the adversarial image x ∈ R n is created such that x -x p < and F (x) = F (x ). Common values for p are 0, 2, ∞, and is chosen small enough so that the perturbation is imperceptible. Various algorithms have been proposed for creating x from x. Optimization-based methods solve a surrogate optimization problem based on the classifier's loss and the perturbation norm. In their pioneering paper on adversarial examples, Szegedy et al. (2013) use box-constrained L-BFGS [Fletcher (2013) ] to minimize the surrogate loss function. Carlini & Wagner (2017) propose stronger optimization-based attacks for L 0 , L 2 and L ∞ norms using better objective functions and the Adam optimizer. Gradient-based methods use gradient of the classifier's loss with respect to the input image. Fast Gradient Sign Method (FGSM) [Goodfellow et al. (2014) ] uses a first-order approximation of the function for faster generation and is optimized for the L ∞ norm. Projected Gradient Descent (PGD) [Madry et al. (2017) ] is an iterative variant of FGSM which provides a strong first-order attack by using multiple steps of gradient ascent and projecting perturbed images to an -ball centered at the input. Other variants of FGSM are proposed by Dong et al. (2018) and Kurakin et al. (2016) . Several methods have been proposed for defending against adversarial attacks. 2018) search in the latent (z) space of AC-GAN [Odena et al. (2017) ] to find generated images that can fool a target classifier but yield correct predictions on AC-GAN's auxiliary classifier. They constrain the search region of z so that it is close to a randomly sampled noise vector, and show results on MNIST, SVHN and CelebA datasets. Requiring two classifiers to have inconsistent predictions degrades sample quality of the model. As we show in the appendix, training with these adversarial examples hurts the model's performance on clean images. Moreover, this approach has no control over the generation process since small changes in the z space can lead to large changes in generated images and even create unrealistic samples. On the other hand, our method manipulates high-resolution real or synthesized images in a fine-grained manner owing to the interpretable disentangled latent space. It also generates samples which improve the model's accuracy on clean images both in classification and segmentation tasks. To further illustrate difference of our approach with Song et al. (2018) , we plot t-SNE embeddings of real images from CelebA-HQ as well as adversarial examples from our method and Song et al.'s approach in the appendix and show that our adversarial images stay closer to the manifold of real images.

3. APPROACH

Most of the existing works on unrestricted adversarial attacks rely on geometric transformations and deformations which are oblivious to latent factors of variation. In this paper, we leverage disentangled latent representations of images for unrestricted adversarial attacks. We build upon state-of-the-art generative models and consider various target tasks: classification, semantic segmentation and object detection.

3.1. CLASSIFICATION

Style-GAN [Karras et al. (2018) ] is a state-of-the-art generative model which disentangles high-level attributes and stochastic variations in an unsupervised manner. Stylistic variations are represented by style variables and stochastic details are captured by noise variables. Changing the noise only affects low-level details, leaving the overall composition and high-level aspects intact. This allows us to manipulate the noise variables such that variations are barely noticeable by the human eye. The style variables affect higher level aspects of image generation. For instance, when the model is trained on bedrooms, style variables from the top layers control viewpoint of the camera, middle layers select the particular furniture, and bottom layers deal with colors and details of materials. This allows us to manipulate images in a controlled manner, providing an avenue for fine-grained unrestricted attacks. Formally, we can represent Style-GAN with a mapping function f and a synthesis network g. The mapping function is an 8-layer MLP which takes a latent code z, and produces an intermediate latent vector w = f (z). This vector is then specialized by learned affine transformations A to style variables y, which control adaptive instance normalization operations after each convolutional layer of the synthesis network g. Noise inputs are single-channel images consisting of un-correlated Gaussian noise that are fed to each layer of the synthesis network. Learned perfeature scaling factors B are used to generate noise variables η which are added to the output of convolutional layers. The synthesis network takes style y and noise η as input, and generates an image x = g(y, η). We pass the generated image to a pre-trained classifier F . We seek to slightly modify x so that F can no longer classify it correctly. We achieve this through perturbing the style and noise tensors. We initialize adversarial style and noise variables as y adv = η, we can iteratively perform gradient ascent in the style and noise spaces of the generator to find values that maximize the classifier's loss. Alternatively, as proposed by Kurakin et al. (2016) , we can use the least-likely predicted class ll x = arg min(F (x)) as our target. We found this approach more effective in practice. At time step t, the update rule for the style and noise variables is: y (t+1) adv = y (t) adv -• sign(∇ y (t) adv J(F (g(y (t) adv , η (t) adv )), ll x )) (1) η (t+1) adv = η (t) adv -δ • sign(∇ η (t) adv J(F (g(y (t) adv , η (t) adv )), ll x )) in which J(•, •) is the classifier's loss function, F (•) gives the probability distribution over classes, x = g(y, η), and , δ ∈ R are step sizes. We use ( , δ) = (0.004, 0.2) and (0.004, 0.1) for LSUN and CelebA-HQ respectively. We perform multiple steps of gradient descent (usually 2 to 10) until the classifier is fooled. Generating targeted adversarial examples is more challenging as we need to change the prediction to a specific class T . In this case, we perform gradient descent to minimize the classifier's loss with respect to the target: y (t+1) adv = y (t) adv -• sign(∇ y (t) adv J(F (g(y (t) adv , η (t) adv )), T )) (3) η (t+1) adv = η (t) adv -δ • sign(∇ η (t) adv J(F (g(y (t) adv , η (t) adv )), T )) We use ( , δ) = (0.005, 0.2) and (0.004, 0.1) in the experiments on LSUN and CelebA-HQ respectively. In practice 3 to 15 updates suffice to fool the classifier. Note that we only control deviation from the initial latent variables, and do not impose any norm constraint on generated images.

3.1.1. INPUT-CONDITIONED GENERATION

Generation can also be conditioned on real input images by embedding them into the latent space of Style-GAN. We first synthesize images similar to the given input image I by optimizing values of y and η such that g(y, η) is close to I. More specifically, we minimize the perceptual distance [Johnson et al. (2016) ] between g(y, η) and I. We can then proceed similar to equations 1-4 to perturb these tensors and generate the adversarial image. Realism of synthesized images depends on inference properties of the generative model. In practice, generated images resemble input images with high fidelity especially for CelebA-HQ images.

3.2. SEMANTIC SEGMENTATION AND OBJECT DETECTION

We also consider the task of semantic segmentation and leverage the generative model proposed by Park et al. (2019) . The model is conditioned on input semantic layouts and uses SPatially-Adaptive (DE)normalization (SPADE) modules to better preserve semantic information against common normalization layers. The layout is first projected onto an embedding space and then convolved to produce the modulation parameters γ and β. We adversarially modify these parameters with the goal of fooling a segmentation model. We consider non-targeted attacks using per-pixel predictions and compute gradient of the loss function with respect to the modulation parameters with an update rule similar to equations 1 and 2. Figure 2 illustrates the architecture. Note that manipulating variables at smaller resolutions lead to coarser changes. We consider a similar architecture for the object detection task except that we pass the generated image to the detection model and try to increase its loss. Results for this task are shown in the appendix.

4. RESULTS AND DISCUSSION

We provide qualitative and quantitative results using experiments on LSUN [Yu et al. (2015) ] and CelebA-HQ [Karras et al. (2017) ]. LSUN contains 10 scene categories and 20 object categories. We use all the scene classes as well as two object classes: cars and cats. We consider this dataset since it is used in Style-GAN, and is well suited for a classification task. For the scene categories, a 10-way classifier is trained based on Inception-v3 [Szegedy et al. (2016) ] which achieves an accuracy of 87.7% on LSUN's test set. The two object classes also appear in ImageNet [Deng et al. (2009) ], a richer dataset containing 1000 categories. Therefore, for experiments on cars and cats we use an Inception-v3 model trained on ImageNet. This allows us to explore a broader set of categories in our attacks, and is particularly helpful for targeted adversarial examples. CelebA-HQ consists of 30,000 face images at 1024 × 1024 resolution. We consider the gender classification task, and use the classifier provided by Karras et al. (2018) . This is a binary task for which targeted and non-targeted attacks are similar. In order to synthesize a variety of adversarial examples, we use different random seeds in Style-GAN to obtain various values for z, w, y and η. Style-based adversarial examples are generated by initializing y adv with the value of y, and iteratively updating it as in equation 1 (or 3) until the resulting image g(y adv , η) fools the classifier F . Noise-based adversarial examples are created similarly using η adv and the update rule in equation 2 (or 4). While using different step sizes makes a fair comparison difficult, we generally found it easier to fool the model by manipulating the noise variables. We can also combine the effect of style and noise by simultaneously updating y adv and η adv in each iteration, and feeding g(y adv , η adv ) to the classifier. In this case, the effect of style usually dominates since it creates coarser changes. Figure 4 depicts adversarial examples on CelebA-HQ gender classification. Males are classified as females and vice versa. As we observe, various facial features are altered by the model yet the identity is preserved. Similar to LSUN images, noise-based changes are more subtle than style-based ones, and we observe a spectrum of highlevel, mid-level and low-level changes. Figure 5 illustrates adversarial examples conditioned on real input images using the procedure described in Section 3.1.1. Synthesized images resemble inputs with high fidelity, and set the initial values in our optimization process. In some cases, we can notice how the model is altering masculine or feminine features. For instance, women's faces become more masculine in columns 2 and 4, and men's beard is removed in column 3 of Figure 4 and column 1 of Figure 5 . We also show results on semantic segmentation in Figure 6 in which we consider non-targeted attacks on DeepLab-v2 [Chen et al. (2017) ] with a generator trained on the COCO-stuff dataset [Caesar et al. (2018) ]. We iteratively modify modulation parameters at all layers, using a step size of 0.001, to maximize the segmentation loss with respect to the given label map. As we observe, subtle modifications to images lead to large drops in accuracy. Unlike perturbation-based attacks, L p distances between original and adversarial images are large, yet they are visually similar. Moreover, we do not observe high-frequency perturbations in the generated images. The model To ensure that the model maximally benefits from these additional samples, we need to avoid unrealistic examples which do not resemble natural images. Therefore, we only include samples that can fool the model in less than a specific number of iterations. We use a threshold of 10 as the maximum number of iterations, and demonstrate results on classification and semantic segmentation. We use the first 10 generated examples for each starting image in the segmentation task. Table 1 shows accuracy of the strengthened and orig- inal classifiers on clean and adversarial test images. For the segmentation task we report the average accuracy of adversarial images at iteration 10. Similar to norm-constrained perturbations, adversarial training is an effective defense against our unrestricted attacks. Note that accuracy of the model on clean test images is improved after adversarial training. This is in contrast to training with norm-bounded adversarial inputs which hurts the classifier's performance on clean images, and it is due to the fact that unlike perturbation-based inputs, our generated images live on the manifold of realistic images as constrained by the generative model. Using randomized smoothing with Gaussian noise, their defense guarantees a certain top-1 accuracy for perturbations with L 2 norm less than a specific threshold. We demonstrate that our unrestricted attacks can break this certified defenses on ImageNet. We use 400 noise-based and 400 style-based adversarial images from the object categories of LSUN, and group all relevant ImageNet classes as the ground-truth. Our adversarial examples are evaluated against a randomized smoothing classifier based on ResNet-50 using Gaussian noise with standard deviation of 0.5. Table 2 shows accuracy of the model on clean and adversarial images. As we observe, the accuracy drops on adversarial inputs, and the certified defense is not effective against our attack. Note that we stop updating adversarial images as soon as the model is fooled. If we keep updating for more iterations afterwards, we can achieve even stronger attacks. 

5. CONCLUSION AND FUTURE WORK

The area of unrestricted adversarial examples is relatively under-explored. Not being bounded by a norm threshold provides its own pros and cons. It allows us to create a diverse set of attack mechanisms; however, fair comparison of relative strength of these attacks is challenging. It is also unclear how to even define provable defenses. While several papers have attempted to interpret norm-constrained attacks in terms of decision boundaries, there has been less effort in understanding the underlying reasons for models' vulnerabilities to unrestricted attacks. We believe these can be promising directions for future research. We also plan to further explore transferability of our approach for black-box attacks in the future. 

A.4 OBJECT DETECTION RESULTS

Figure 8 illustrates results on the object detection task using the RetinaNet target model [Lin et al. (2017) ]. We observe that small changes in the images lead to incorrect bounding boxes and predictions by the model. In segmentation results shown in Figure 11 we simultaneously modify both γ and β parameters of the SPADE module (Figure 2 ). We can also consider the impact of modifying each parameter separately. Figure 9 illustrates the results. As we observe, changing γ and β modifies fine details of the images which are barely perceptible yet they lead to large changes in predictions of the segmentation model.

A.6 ADVERSARIAL CHANGES TO SINGLE IMAGES

Figure 10 illustrates how images vary as we manipulate specific layers of the network. We observe that each set of layers creates different adversarial changes. For instance, layers 12 to 18 mainly change low-level color details. 



Figure 1: Classification architecture. Style (y) and noise (η) variables are used to generate images g(y, η) which are fed to the classifier F . Adversarial style and noise tensors are initialized with y and η and iteratively updated using gradients of the loss function J.

Figure 2: Semantic segmentation architecture. Adversarial parameters γ adv and β adv are initialized with γ and β, and iteratively updated to fool the segmentation model G.

adv = η, and iteratively update them in order to fool the classifier. Loss of the classifier determines the update rule, which in turn depends on the type of attack. As common in the literature, we consider two types of attacks: non-targeted and targeted. In order to generate non-targeted adversarial examples, we need to change the model's original prediction. Starting from initial values y

Figure 3 illustrates generated adversarial examples on LSUN. Original image g(y, η), noise-based image g(y, η adv ) and style-based image g(y adv , η) are shown. Adversarial images look almost indistinguishable from natural images. Manipulating the noise variable results in subtle, imperceptible changes. Varying the style leads to coarser changes such as different colorization, pose changes, and even removing or inserting objects in the scene.We can also control granularity of changes by selecting specific layers of the model. Manipulating top layers, corresponding to coarse spatial resolutions, results in high-level changes. Lower layers, on the other hand, modify finer details. In the first two columns of Figure3, we only modify top 6 layers (out of 18) to generate adversarial images. The middle two columns change layers 7 to 12, and the last column uses the bottom 6 layers.

Figure 3: Unrestricted adversarial examples on LSUN for a) non-targeted and b) targeted attacks. Predicted classes are shown under each image. learns to modify the initial input without leaving the manifold of realistic images. Additional examples and higherresolution images are provided in the appendix.

Figure 4: Unrestricted adversarial examples on CelebA-HQ gender classification. From top to bottom: original, noise-based and style-based adversarial images. Males are classified as females and vice versa.

Figure 5: Input-conditioned adversarial examples on CelebA-HQ gender classification. From top to bottom: input, generated and style-based images. Males are classified as females and vice versa.

Figure 6: Unrestricted adversarial examples for semantic segmentation. Generated images, corresponding predictions and their accuracy (ratio of correctly predicted pixels) are shown for different number of iterations.

Figure 8: Unrestricted adversarial examples for object detection. Generated images and their corresponding predictions are shown for different number of iterations.

Figure 9: Impact of separately modifying γ and β parameters on segmentation results. Modified images at different iterations and corresponding predictions are shown. In the first two rows only the γ values are changed and in the last two rows only the β values are modified.

Figure 10: Impact of manipulating different layers of the network on generated adversarial images.

Figure 11: Unrestricted adversarial examples for semantic segmentation. Generated images, corresponding predictions and their accuracy (ratio of correctly predicted pixels) are shown for different number of iterations.

Figure 12: Unrestricted adversarial examples on CelebA-HQ gender classification. From top to bottom: Original, noise-based and style-based adversarial images. Males are classified as females and vice versa.

Figure 13: Unrestricted adversarial examples on LSUN for a) non-targeted and b) targeted attacks. From top to bottom: original, noise-based and style-based images.

Figure 14: High resolution versions of adversarial images. From left to right: original, noise-based and style-based images.

Figure 14: (cont.) High resolution versions of adversarial examples. From left to right: original, noise-based and style-based images.

We demonstrate that adversarial training with our examples improves performance of the model on clean

Accuracy of adversarially trained and original models on clean and adversarial test images.4.2 USER STUDYNorm-constrained attacks provide visual realism by L p proximity to a real input. To verify that our unrestricted adversarial examples are realistic and correctly classified by an oracle, we perform human evaluation using Amazon Mechanical Turk. In the first experiment, each adversarial image is assigned to three workers, and their majority vote is considered as the label. The user interface for each worker contains nine images, and shows possible labels to choose from. We use 2400 noise-based and 2400 style-based adversarial images from the LSUN dataset, containing 200 samples from each class (10 scene classes and 2 object classes). The results indicate that 99.2% of workers' majority votes match the ground-truth labels. This number is 98.7% for style-based adversarial examples and 99.7% for noise-based ones. As we observe in Figure3, noise-based examples do not deviate much from the original image, resulting in easier prediction by a human observer. On the other hand, style-based images show coarser changes, which in a few cases result in unrecognizable images or false predictions by the workers.We use a similar setup in the second experiment but for classifying real versus fake (generated). We also include 2400 real images as well as 2400 unperturbed images generated by Style-GAN. 74.7% of unperturbed images are labeled by workers as real. This number is 74.3% for noise-based adversarial examples and 70.8% for style-based ones, indicating less than 4% drop compared with unperturbed images generated by Style-GAN.

Accuracy of a certified classifier equipped with randomized smoothing on our adversarial images.

Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015. Table Average number of iterations (mean ± std) required to fool the classifier.

A APPENDIX

A.1 COMPARISON WITH SONG ET AL. (2018) We show that adversarial training with examples generated by Song et al. (2018) hurts the classifier's performance on clean images. Table 3 demonstrates the results. We use the same classifier architectures as Song et al. (2018) and consider their basic attack. We observe that the test accuracy on clean images drops by 1.3%, 1.4% and 1.1% on MNIST, SVHN and CelebA respectively. As we show in Table 1 training with our examples improves the accuracy, demonstrating difference of our approach with that of Song et al. (2018) 

A.3 NUMBER OF ITERATIONS

To make sure the iterative process always converges in a reasonable number of steps, we measure the number of updates required to fool the classifier on 1000 randomly-selected images. Results are shown in Table 5 . Note that for targeted attacks we first randomly sample a target class different from the ground-truth label for each image.

