AT-GAN: AN ADVERSARIAL GENERATIVE MODEL FOR NON-CONSTRAINED ADVERSARIAL EXAMPLES Anonymous

Abstract

With the rapid development of adversarial machine learning, numerous adversarial attack methods have been proposed. Typical attacks are based on a search in the neighborhood of input image to generate a perturbed adversarial example. Since 2017, generative models are adopted for adversarial attacks, and most of them focus on generating adversarial perturbations from input noise or input image. Thus the output is restricted by input for these works. A recent work targets "unrestricted adversarial example" using generative model but their method is based on a search in the neighborhood of input noise, so actually their output is still constrained by input. In this work, we propose AT-GAN (Adversarial Transfer on Generative Adversarial Net) to train an adversarial generative model that can directly produce adversarial examples. Different from previous works, we aim to learn the distribution of adversarial examples so as to generate semantically meaningful adversaries. AT-GAN achieves this goal by first learning a generative model for real data, followed by transfer learning to obtain the desired generative model. Once trained and transferred, AT-GAN could generate adversarial examples directly and quickly for any input noise, denoted as non-constrained adversarial examples. Extensive experiments and visualizations show that AT-GAN can efficiently generate diverse adversarial examples that are realistic to human perception, and yields higher attack success rates against adversarially trained models.

1. INTRODUCTION

In recent years, Deep Neural Networks (DNNs) have been found vulnerable to adversarial examples (Szegedy et al., 2014) , which are well-crafted samples with tiny perturbations imperceptible to humans but can fool the learning models. Despite the great success of the deep learning empowered applications, many of them are safety-critical, for example under the scenario of self-driving cars (Eykholt et al., 2018; Cao et al., 2019) , raising serious concerns in academy and industry. Numerous works of adversarial examples have been developed on adversarial attacks (Goodfellow et al., 2015; Carlini & Wagner, 2017; Madry et al., 2018 ), adversarial defenses (Goodfellow et al., 2015; Kurakin et al., 2017; Song et al., 2019) and exploring the property of adversarial examples (He et al., 2018; Shamir et al., 2019) . For adversarial attacks, most studies focus on the perturbation-based adversarial examples constrained by input images, which is also the generally accepted conception of adversarial examples. Generative models are also adopted recently to generate adversarial perturbations from an input noise (Reddy Mopuri et al., 2018; Omid et al., 2018) or from a given image (Xiao et al., 2018; Bai et al., 2020) , and such perturbations are added to the original image to craft adversarial examples. Song et al. (2018) propose to search a neighborhood noise around the input noise of a Generative Adversarial Net (GAN) (Goodfellow et al., 2014) such that the output is an adversarial example, which they denoted as unrestricted adversarial example as there is no original image in their method. However, their output is still constrained by the input noise, and the search is time-consuming. In this work, we propose an adversarial generative model called AT-GAN (Adversarial Transfer on Generative Adversarial Net), which aims to learn the distribution of adversarial examples. Unlike previous works that constrain the adversaries in the neighborhood of input image or input noise, including the prominent work of Song et al. ( 2018) that searches over the neighborhood of the input noise of a pre-trained GAN in order to find a noise whose output image is misclassified by the target classifier, AT-GAN is an adversarial generative model that could produce semantically meaningful adversarial examples directly from any input noise, and we call such examples the non-constrained adversarial examples. Specifically, we first develop a normal GAN to learn the distribution of benign data so that it can produce plausible images that the classifier and a human oracle will classify in the same way. Then we transfer the pre-trained GAN into an adversarial GAN called AT-GAN that can fool the target classifier while being still well recognized by the human oracle. AT-GAN is a conditional GAN that has learned to estimate the distribution of adversarial examples for the target classifier, so AT-GAN can directly generate adversarial examples from any random noise, leading to high diversity and efficiency. We implement AT-GAN by adopting AC-GAN (Odena et al., 2017) and WGAN-GP (Gulrajani et al., 2017) in the pre-training stage, then do transfer learning for the adversary generation. Here we develop AT-GAN on three benchmark datasets, namely MNIST, Fashion-MNIST and CelebA, and apply typical defense methods to compare AT-GAN with existing search-based attacks. Empirical results show that the non-constrained adversarial examples generated by AT-GAN yield higher attack success rates, and state-of-the-art adversarially trained models exhibit little robustness against AT-GAN, indicating the high diversity of our adversaries. In addition, AT-GAN, as a generation-based adversarial attack, is more efficient than the search-based adversarial attacks. Note that all conditional GANs that can craft realistic examples could be used for the implementation of AT-GAN. For another demonstration, we adopt StyleGAN2-ada (Karras et al., 2020a) and develop AT-GAN on CIFAR-10 benchmark dataset using wide ResNet w32-10 (Zagoruyko & Komodakis, 2016) as the target classifier. Empirical results show that AT-GAN can produce plausible adversarial images, and yield higher attack success rates on the adversarially trained models.

2. PRELIMINARIES

In this section, we provide definitions on several types of adversarial examples and adversarial attacks, and give a brief overview of adversarial attacks using GAN. Other related works on typical adversarial attacks and defenses (Goodfellow et al., 2015; Madry et al., 2018; Tramèr et al., 2018) , as well as some typical GANs (Goodfellow et al., 2014; Radford et al., 2016; Odena et al., 2017; Arjovsky et al., 2017; Gulrajani et al., 2017) are introduced in Appendix A.

2.1. DEFINITIONS ON ADVERSARIES

Let X be the set of all digital images under consideration for a learning task, Y ∈ R be the output label space and p z ∈ R m be an arbitrary probability distribution (e.g. Gaussian distribution) where m is the dimension of p z . A deep learning classifier f : X → Y takes an image x ∈ X and predicts its label f (x). Suppose p x and p adv are the distributions of benign images and adversarial examples, respectively. Assume we have an oracle classifier o : X → Y, which could always predict the correct label for any image x ∈ X , we define several types of adversarial examples as follows. For perturbation-based adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2016) , tiny perturbations are added to the input images, which are imperceptible to humans but can cause the target classifier to make wrong predictions. Definition 1. Perturbation-based Adversarial Examples. Given a subset (trainset or testset) images T ⊂ X and a small constant > 0, the perturbation-based adversarial examples can be defined as: A p = {x adv ∈ X |∃x ∈ T , x -x adv p < ∧ f (x adv ) = o(x adv ) = f (x) = o(x)}.

Song et al. (2018) define a new type of adversarial examples called unrestricted adversarial examples,

which is not related to the subset (trainset or testset) images, by adding adversarial perturbation to the input noise of a mapping, such as GAN, so that the output of the perturbed noise is an adversary to the target classifier. Definition 2. Unrestricted Adversarial Examples. Given a mapping G from z ∼ p z to G(z, y) ∼ p θ , where p θ is an approximated distribution of p x , and a small constant > 0, the unrestricted adversarial examples can be defined as: A u = {G(z * , y s ) ∈ X |∃z ∼ p z , z * ∼ p z , z -z * p < ∧ f (G(z * , y s )) = o(G(z * , y s )) = f (G(z, y s )) = o(G(z, y s )) = y s } where y s is the source label.

