FINE-GRAINED SYNTHESIS OF UNRESTRICTED ADVERSARIAL EXAMPLES Anonymous

Abstract

We propose a novel approach for generating unrestricted adversarial examples by manipulating fine-grained aspects of image generation. Unlike existing unrestricted attacks that typically hand-craft geometric transformations, we learn stylistic and stochastic modifications leveraging state-of-the-art generative models. This allows us to manipulate an image in a controlled, fine-grained manner without being bounded by a norm threshold. Our approach can be used for targeted and non-targeted unrestricted attacks on classification, semantic segmentation and object detection models. Our attacks can bypass certified defenses, yet our adversarial images look indistinguishable from natural images as verified by human evaluation. Moreover, we demonstrate that adversarial training with our examples improves performance of the model on clean images without requiring any modifications to the architecture. We perform experiments on LSUN, CelebA-HQ and COCO-Stuff as high resolution datasets to validate efficacy of our proposed approach.

1. INTRODUCTION

Adversarial examples, inputs resembling real samples but maliciously crafted to mislead machine learning models, have been studied extensively in the last few years. Most of the existing papers, however, focus on normconstrained attacks and defenses, in which the adversarial input lies in an -neighborhood of a real sample using the L p distance metric (commonly with p = 0, 2, ∞). For small , the adversarial input is quasi-indistinguishable from the natural sample. For an adversarial image to fool the human visual system, it is sufficient to be normconstrained; but this condition is not necessary. Moreover, defenses tailored for norm-constrained attacks can fail on other subtle input modifications. This has led to a recent surge of interest on unrestricted adversarial attacks in which the adversary is not bounded by a norm threshold. These methods typically hand-craft transformations to capture visual similarity. Spatial transformations [Engstrom et al. (2017); Xiao et al. (2018); Alaifari et al. (2018) ], viewpoint or pose changes [Alcorn et al. (2018) ], inserting small patches [Brown et al. (2017) ], among other methods, have been proposed for unrestricted adversarial attacks. In this paper, we focus on fine-grained manipulation of images for unrestricted adversarial attacks. We build upon state-of-the-art generative models which disentangle factors of variation in images. We create fine and coarsegrained adversarial changes by manipulating various latent variables at different resolutions. Loss of the target network is used to guide the generation process. The pre-trained generative model constrains the search space for our adversarial examples to realistic images, thereby revealing the target model's vulnerability in the natural image space. We verify that we do not deviate from the space of realistic images with a user study as well as a t-SNE plot comparing distributions of real and adversarial images (see Fig. 7 in the appendix). As a result, we observe that including these examples in training the model enhances its accuracy on clean images. Our contributions can be summarized as follows: • We present the first method for fine-grained generation of high-resolution unrestricted adversarial examples in which the attacker controls which aspects of the image to manipulate, resulting in a diverse set of realistic, on-the-manifold adversarial examples. 



We demonstrate that adversarial training with our examples improves performance of the model on clean We demonstrate that our proposed attack can break certified defenses on norm-bounded perturbations.Most of the existing works on adversarial attacks and defenses focus on norm-constrained adversarial examples: for a given classifier F : R n → {1, . . . , K} and an image x ∈ R n , the adversarial image x ∈ R n is created such that x -x p < and F (x) = F (x ). Common values for p are 0, 2, ∞, and is chosen small enough so that the perturbation is imperceptible. Various algorithms have been proposed for creating x from x. Optimization-based methods solve a surrogate optimization problem based on the classifier's loss and the perturbation norm. In their pioneering paper on adversarial examples,Szegedy et al. (2013) use box-constrained L-BFGS [Fletcher (2013)] to minimize the surrogate loss function.Carlini & Wagner (2017)  propose stronger optimization-based attacks for L 0 , L 2 and L ∞ norms using better objective functions and the Adam optimizer. Gradient-based methods use gradient of the classifier's loss with respect to the input image. Fast Gradient Sign Method (FGSM)[Goodfellow  et al. (2014)] uses a first-order approximation of the function for faster generation and is optimized for the L ∞ norm. Projected Gradient Descent (PGD)[Madry et al. (2017)] is an iterative variant of FGSM which provides a strong first-order attack by using multiple steps of gradient ascent and projecting perturbed images to an -ball centered at the input. Other variants of FGSM are proposed byDong et al. (2018)  andKurakin et al. (2016).For an image to be adversarial, it needs to be visually indistinguishable from real images. One way to achieve this is by applying subtle geometric transformations to the input image. Spatially transformed adversarial examples are introduced by Xiao et al. (2018) in which a flow field is learned to displace pixels of the image. Similarly, Alaifari et al. (2018) iteratively apply small deformations to the input in order to obtain the adversarial image. Engstrom et al. (2017) show that simple translations and rotations are enough for fooling deep neural networks. Alcorn et al. (2018) manipulate pose of an object to fool deep neural networks. They estimate parameters of a 3D renderer that cause the target model to misbehave in response to the rendered image. Another approach for evading the norm constraint is to insert new objects in the image. Adversarial Patch [Brown et al. (2017)] creates an adversarial image by completely replacing part of an image with a synthetic patch, which is image-agnostic and robust to transformations. Existence of on-the-manifold adversarial examples is also shown by Gilmer et al. (2018), that consider the task of classifying between two concentric n-dimensional spheres. Stutz et al. (2019) demonstrate that both robust and accurate models are possible by using on-the-manifold adversarial examples. A challenge for creating unrestricted adversarial examples and defending against them is introduced by Brown et al. (2018) using the simple task of classifying between birds and bicycles. The recent work by Gowal et al. (2020) show that adversarial training with examples generated by StyleGAN can improve performance of the model on clean images. They consider the classification task on low-resolution datasets such as ColorMNIST and CelebA, and only use fine changes in their adversarial training. Our approach is effective on high-resolution datasets such as CelebA-HQ and LSUN, uses a range of low-level to high-level changes for adversarial training and encompasses several tasks including classification, segmentation and detection. In addition, we demonstrate that our adversarial examples can break certified defenses on norm-constrained perturbations and are realistic as verified by human evaluation. Song

