AT-GAN: AN ADVERSARIAL GENERATIVE MODEL FOR NON-CONSTRAINED ADVERSARIAL EXAMPLES Anonymous

Abstract

With the rapid development of adversarial machine learning, numerous adversarial attack methods have been proposed. Typical attacks are based on a search in the neighborhood of input image to generate a perturbed adversarial example. Since 2017, generative models are adopted for adversarial attacks, and most of them focus on generating adversarial perturbations from input noise or input image. Thus the output is restricted by input for these works. A recent work targets "unrestricted adversarial example" using generative model but their method is based on a search in the neighborhood of input noise, so actually their output is still constrained by input. In this work, we propose AT-GAN (Adversarial Transfer on Generative Adversarial Net) to train an adversarial generative model that can directly produce adversarial examples. Different from previous works, we aim to learn the distribution of adversarial examples so as to generate semantically meaningful adversaries. AT-GAN achieves this goal by first learning a generative model for real data, followed by transfer learning to obtain the desired generative model. Once trained and transferred, AT-GAN could generate adversarial examples directly and quickly for any input noise, denoted as non-constrained adversarial examples. Extensive experiments and visualizations show that AT-GAN can efficiently generate diverse adversarial examples that are realistic to human perception, and yields higher attack success rates against adversarially trained models.

1. INTRODUCTION

In recent years, Deep Neural Networks (DNNs) have been found vulnerable to adversarial examples (Szegedy et al., 2014) , which are well-crafted samples with tiny perturbations imperceptible to humans but can fool the learning models. Despite the great success of the deep learning empowered applications, many of them are safety-critical, for example under the scenario of self-driving cars (Eykholt et al., 2018; Cao et al., 2019) , raising serious concerns in academy and industry. Numerous works of adversarial examples have been developed on adversarial attacks (Goodfellow et al., 2015; Carlini & Wagner, 2017; Madry et al., 2018) , adversarial defenses (Goodfellow et al., 2015; Kurakin et al., 2017; Song et al., 2019) and exploring the property of adversarial examples (He et al., 2018; Shamir et al., 2019) . For adversarial attacks, most studies focus on the perturbation-based adversarial examples constrained by input images, which is also the generally accepted conception of adversarial examples. Generative models are also adopted recently to generate adversarial perturbations from an input noise (Reddy Mopuri et al., 2018; Omid et al., 2018) or from a given image (Xiao et al., 2018; Bai et al., 2020) , and such perturbations are added to the original image to craft adversarial examples. Song et al. (2018) propose to search a neighborhood noise around the input noise of a Generative Adversarial Net (GAN) (Goodfellow et al., 2014) such that the output is an adversarial example, which they denoted as unrestricted adversarial example as there is no original image in their method. However, their output is still constrained by the input noise, and the search is time-consuming. In this work, we propose an adversarial generative model called AT-GAN (Adversarial Transfer on Generative Adversarial Net), which aims to learn the distribution of adversarial examples. Unlike previous works that constrain the adversaries in the neighborhood of input image or input noise, including the prominent work of Song et al. (2018) that searches over the neighborhood of the input noise of a pre-trained GAN in order to find a noise whose output image is misclassified by the target classifier, AT-GAN is an adversarial generative model that could produce semantically meaningful adversarial examples directly from any input noise, and we call such examples the non-constrained adversarial examples. Specifically, we first develop a normal GAN to learn the distribution of benign data so that it can produce plausible images that the classifier and a human oracle will classify in the same way. Then we transfer the pre-trained GAN into an adversarial GAN called AT-GAN that can fool the target classifier while being still well recognized by the human oracle. AT-GAN is a conditional GAN that has learned to estimate the distribution of adversarial examples for the target classifier, so AT-GAN can directly generate adversarial examples from any random noise, leading to high diversity and efficiency. We implement AT-GAN by adopting AC-GAN (Odena et al., 2017) and WGAN-GP (Gulrajani et al., 2017) in the pre-training stage, then do transfer learning for the adversary generation. Here we develop AT-GAN on three benchmark datasets, namely MNIST, Fashion-MNIST and CelebA, and apply typical defense methods to compare AT-GAN with existing search-based attacks. Empirical results show that the non-constrained adversarial examples generated by AT-GAN yield higher attack success rates, and state-of-the-art adversarially trained models exhibit little robustness against AT-GAN, indicating the high diversity of our adversaries. In addition, AT-GAN, as a generation-based adversarial attack, is more efficient than the search-based adversarial attacks. Note that all conditional GANs that can craft realistic examples could be used for the implementation of AT-GAN. For another demonstration, we adopt StyleGAN2-ada (Karras et al., 2020a) and develop AT-GAN on CIFAR-10 benchmark dataset using wide ResNet w32-10 (Zagoruyko & Komodakis, 2016) as the target classifier. Empirical results show that AT-GAN can produce plausible adversarial images, and yield higher attack success rates on the adversarially trained models.

2. PRELIMINARIES

In this section, we provide definitions on several types of adversarial examples and adversarial attacks, and give a brief overview of adversarial attacks using GAN. Other related works on typical adversarial attacks and defenses (Goodfellow et al., 2015; Madry et al., 2018; Tramèr et al., 2018) , as well as some typical GANs (Goodfellow et al., 2014; Radford et al., 2016; Odena et al., 2017; Arjovsky et al., 2017; Gulrajani et al., 2017) are introduced in Appendix A.

2.1. DEFINITIONS ON ADVERSARIES

Let X be the set of all digital images under consideration for a learning task, Y ∈ R be the output label space and p z ∈ R m be an arbitrary probability distribution (e.g. Gaussian distribution) where m is the dimension of p z . A deep learning classifier f : X → Y takes an image x ∈ X and predicts its label f (x). Suppose p x and p adv are the distributions of benign images and adversarial examples, respectively. Assume we have an oracle classifier o : X → Y, which could always predict the correct label for any image x ∈ X , we define several types of adversarial examples as follows. For perturbation-based adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2016) , tiny perturbations are added to the input images, which are imperceptible to humans but can cause the target classifier to make wrong predictions. Definition 1. Perturbation-based Adversarial Examples. Given a subset (trainset or testset) images T ⊂ X and a small constant > 0, the perturbation-based adversarial examples can be defined as: Song et al. (2018) define a new type of adversarial examples called unrestricted adversarial examples, which is not related to the subset (trainset or testset) images, by adding adversarial perturbation to the input noise of a mapping, such as GAN, so that the output of the perturbed noise is an adversary to the target classifier. A p = {x adv ∈ X |∃x ∈ T , x -x adv p < ∧ f (x adv ) = o(x adv ) = f (x) = o(x)}. Definition 2. Unrestricted Adversarial Examples. Given a mapping G from z ∼ p z to G(z, y) ∼ p θ , where p θ is an approximated distribution of p x , and a small constant > 0, the unrestricted adversarial examples can be defined as: A u = {G(z * , y s ) ∈ X |∃z ∼ p z , z * ∼ p z , z -z * p < ∧ f (G(z * , y s )) = o(G(z * , y s )) = f (G(z, y s )) = o(G(z, y s )) = y s } where y s is the source label. In this work, we train a conditional GAN to learn the distribution of adversarial examples and output the corresponding adversary directly from any input noise. To clarify the difference with Song et al. (2018) , we call our generated adversaries the non-constrained adversarial examples. Definition 3. Non-constrained Adversarial Examples. If there is a mapping G * from z ∼ p z to G * (z, y) ∼ q θ , where q θ is an approximated distribution of p adv , the non-constrained adversarial examples can be defined as A n = {G * (z, y s ) ∈ X |f (G * (z, y s )) = o(G * (z, y s )) = y s } where y s is the source label. Here we need to find a mapping G * , e.g. a generative model, such that for z ∼ p z , G * (z, y) is an image in X and the output distribution is an approximated distribution of p adv , for example using the Kullback-Leibler divergence (Kullback & Leibler, 1951) , KL(q θ ||p adv ) < for a small constant . In summary, perturbation-based adversarial examples are based on perturbing an image x ∈ X , and unrestricted adversarial examples (Song et al., 2018) perturbs an input noise z ∼ p z for an existing mapping G. Most perturbation-based adversarial attacks and Song et al. (2018) fall into the search-based adversarial attack. Definition 4. Search-based Adversarial Attack. Given an input vector v ∈ V (either benign image x or random vector z), the search-based adversarial attack searches a vector v : v -v p < where v leads to an adversarial example for the target classifier. In contrast, non-constrained adversarial examples are more generalized so that we need to learn a mapping G * such that for any input noise sampled from distribution p z , the output is an adversarial image. Such a mapping to be learned is called an adversarial generative model, and our method falls into the generation-based adversarial attack. Definition 5. Generation-based Adversarial Attack. Given an input vector v ∈ V (either benign image x or random vector z), the generation-based adversarial attack generates adversarial perturbation or adversarial example directly from v, usually adopting generative models.

2.2. GENERATIVE MODELS FOR ADVERSARIAL ATTACK

Generative models have been adopted for adversarial attack in recent works (Baluja & Fischer, 2017) . Reddy Mopuri et al. (2018) propose a Network for Adversary Generation (NAG) that models the distribution of adversarial perturbations for a target classifier so that their NAG can craft adversarial perturbations from any given random noise, which will be added to the natural image to fool the target classifier. Omid et al. (2018) propose to generate universal or image-dependent adversarial perturbations using U-Net (Ronneberger et al., 2015) or ResNet Generator (He et al., 2016) from any given random noise. Xiao et al. (2018) propose to train AdvGAN that takes an original image as the input and generate adversarial perturbation for the input to craft an adversarial example. Bai et al. (2020) further propose AI-GAN that adopts projected gradient descent (PGD) (Madry et al., 2018) in the training stage to train a GAN to generate target adversarial perturbation for the input image and target class. The above attack methods all fall into the generation-based adversarial attack, and their crafted examples fall into the perturbation-based adversarial examples. Another recent work called PS-GAN (Liu et al., 2019) pre-processes an input seed patch (a small image) to adversarial patch that will be added to a natural image to craft an adversarial example, and an attention model is used to locate the attack area on the natural image. Different from the above methods that generate adversarial perturbations or patches, Song et al. (2018) propose to search a random noise z * around the input noise z of AC-GAN (Odena et al., 2017) such that the corresponding output of AC-GAN is an adversarial example for the target classifier. Their method falls into the search-based adversarial attack, and their crafted examples fall into the unrestricted adversarial examples as there is no original image in their method. AT-GAN falls into the generation-based adversarial attack, and the crafted examples fall into the non-constrained adversarial examples. To clearly distinguish our work, we highlight the differences with most related works as follows: NAG, AdvGAN and AI-GAN vs. AT-GAN. NAG (Reddy Mopuri et al., 2018) , AdvGAN (Xiao et al., 2018) and AI-GAN (Bai et al., 2020) focus on crafting adversarial perturbations by GANs. NAG takes random noise as input and crafts image-agnostic adversarial perturbation. AdvGAN and AI-GAN both use natural images as inputs, and generate the corresponding adversarial perturbations Song's vs. AT-GAN. Song's method (Song et al., 2018) searches over the neighborhood of the input noise for the pre-trained AC-GAN in order to find a noise whose output image is misclassified by the target classifier. They define such adversaries as the unrestricted adversarial examples, however, their adversaries are still constrained by the original input noise. Their method is essentially based on search, while AT-GAN is trained as an adversarial generative model, and our output is not constrained by any neighborhood.

3. AT-GAN: AN ADVERSARIAL GENERATIVE MODEL

Here we first introduce the estimation on the distribution of adversarial examples, then propose the AT-GAN framework, a generation-based adversarial attack for crafting non-constrained adversarial examples. Further analysis is provided that AT-GAN could learn the adversary distribution.

3.1. ESTIMATING THE ADVERSARIAL DISTRIBUTION

In order to generate non-constrained adversarial examples, we need to estimate the distribution of adversarial examples p adv (x adv |y true ) where y true is the true label. Given the parameterized estimated distribution of adversarial examples q θ (x|y true ), we can define the estimation problem as: q θ * (x adv |y true ) = arg min θ∈Ω KL(q θ (x adv |y true ) p adv (x adv |y true )), where θ indicates trainable parameters and Ω is the parameter space. It is hard to calculate equation 1 directly as p adv (x adv |y true ) is unknown. Inspired by the perturbationbased adversarial examples, as shown in Figure 1 , we postulate that for each adversarial example x adv , there exists some benign examples x where x -x adv p < . In other words, p adv (x adv |y true ) is close to p(x|y true ) to some extent and we can obtain p(x|y true ) by Bayes' theorem, p(x|y true ) = p(ytrue|x)•p(x) p(ytrue) , where p(y true |x), p(x) and p(y true ) can be obtained directly from the trainset. Thus, we can approximately solve equation 1 in two stages: 1) Fit the distribution of benign data p θ . 2) Transfer p θ to estimate the distribution of adversarial examples q θ . Specifically, we propose an adversarial generative model called AT-GAN to learn the distribution of adversarial examples. The overall architecture of AT-GAN is illustrated in Figure 2 . Corresponding to the above two stages, we implement AT-GAN by first training a GAN model called AC-WGAN_GP, which combines AC-GAN (Odena et al., 2017) and WGAN_GP (Gulrajani et al., 2017) to get a generator G original , to learn p θ (See Appendix B), then transfering G original to attack the target classifier f for the learning of q θ . We adopt AC-GAN and WGAN-GP for the AT-GAN implementation as they could build a powerful generative model on three evaluated datasets, and Song et al. (2018) also utilize the same combination. But AT-GAN is not limited to the above GANs, and we also implement AT-GAN using StyleGAN2-ada (Karras et al., 2020a) on a different dataset.

3.2. TRANSFERRING THE GENERATOR FOR ATTACK

After the original generator G original is trained, we transfer the generator G original to learn the distribution of adversarial examples in order to attack the target model. As illustrated in Figure 2 (b), there are three neural networks, including the original generator G original , the attack generator G attack to be transferred that is initialized by the weights of G original , and the classifier f to be attacked. The goal of the second stage can be described as: G * attack = arg min G attack ||G original (z, y s ) -G attack (z, y s )|| p s. t. f (G(z, y s )) = y t = y s , where y t denotes the target label, • p denotes the p norm and we focus on p = 2 in this work. To optimize equation 2, we construct the loss function by L 1 and L 2 , where L 1 aims to assure that f yields the target label y t that is fixed for target attack for each category: L 1 = E z∼pz [H(f (G attack (z, y s )), y t )]. Here H(•, •) denotes the cross entropy between the two terms and y s is sampled from Y. L 2 aims to assure that the adversarial generator G attack generates realistic examples: L 2 = E z∼pz [||G original (z, y s ) + ρ -G attack (z, y s )|| p ]. ( ) Here ρ is a small uniform random noise constrained by both l 0 and l ∞ norm. We add ρ to constrain G attack (z, y s ) to be in the neighborhood of G original (z, y s ) rather than be exactly the same as G original (z, y s ). The objective function for transferring G original to G attack can be formulated as L = αL 1 , βL 2 , where α and β are hyper-parameters to control the training process. Note that in the case that α = 1 and β → ∞, the objective function is similar to that of the perturbation-based attacks (Goodfellow et al., 2015; Tramèr et al., 2018; Madry et al., 2018) . For the untargeted attack, we can replace y t in L a with the maximum confidence of prediction label y except for y s , max y =ys f (y|G attack (z, y s )).

3.3. THEORETICAL ANALYSIS ON AT-GAN

This subsection provides theoretical analysis on why AT-GAN can generate as realistic and diverse non-constrained adversarial examples as real data. We will prove that under ideal condition, AT-GAN can estimate the distribution of adversarial examples, which is close to that of real data. Suppose p data is the distribution of real data, p g and p a are the distribution learned by the generator of AC-WGAN_GP and AT-GAN respectively. For the optimization of equation 4, L 2 aims to constrain the image generated by G attack in the -neighborhood of G original . We prove that under the ideal condition that L 2 guarantees G attack (z, y s ) to be close enough to G original (z, y s ) for any input noise z, the distribution of AT-GAN almost coincides the distribution of AC-WGAN_GP. Formally, we state our result for the two distributions as follows. Theorem 1. Suppose max z,y L 2 < , we have KL(p a p g ) → 0 when → 0. The proof of Theorem 1 is in Appendix C. Samangouei et al. (2018) prove that the global optimum of WGAN is p g = p data and we show that the optimum of AC-WGAN_GP has the same property. We formalize the property as follows. Theorem 2. The global minimum of the virtual training of AC-WGAN_GP is achieved if and only if p g = p data . The proof of Theorem 2 is in Appendix C. According to Theorem 1 and 2, under the ideal condition, we conclude p a ≈ p g = p data , which indicates that the distribution of non-constrained adversarial examples learned by AT-GAN is very close to that of real data as discussed in Section 3.1, so that the non-constrained adversarial instances are as realistic and diverse as the real data.

4. EXPERIMENTS

In this section, we provide two implementations of AT-GAN to validate the effectiveness and efficiency of the proposed approach. Empirical experiments demonstrate that AT-GAN yields higher attack success rates against adversarially trained models with higher efficiency. Besides, AT-GAN can learn a distribution of adversarial examples which is close to the real data distribution, and generate realistic and diverse adversarial examples.

4.1. EXPERIMENTAL SETUP

Datasets. We consider four standard datasets, namely MNIST (LeCun et al., 1989) , Fashion-MNIST (Xiao et al., 2017) , CelebA (Liu et al., 2015) on the AT-GAN implementation using AC-GAN (Odena et al., 2017) and WGAN_GP (Gulrajani et al., 2017) , and CIFAR-10 dataset (Krizhevsky et al., 2009) on the AT-GAN implementation of StyleGAN2-ada (StyleGAN2 with adaptive discriminator augmentation) (Karras et al., 2020a) . MNIST is a dataset of hand written digits from 0 to 9. Fashion-MNIST is similar to MNIST with 10 categories of fashion clothes. CelebA contains more than 200, 000 celebrity faces. We group them according to female/male and focus on gender classification as in Song et al. (2018) . CIFAR-10 consists of 32 × 32 color images in 10 classes, with 6, 000 images per class. For all datasets, we normalize the pixel values into range [0, 1]. Baselines. We compare AT-GAN with the search-based attack methods, including Song's (Song et al., 2018) for unrestricted adversarial examples, as well as FGSM (Goodfellow et al., 2015) , PGD (Madry et al., 2018) and R+FGSM (Tramèr et al., 2018) for perturbation-based adversarial examples. Note that although the perturbation-based results are not directly comparable to ours as they are limited to small perturbations on real images, they can provide a good sense on the model robustness. Models. For MNIST and Fashion-MNIST, we adopt four models used in Tramèr et al. (2018) , denoted as Model A to D. For CelebA, we consider three models, i.e. CNN, VGG16 (Simonyan & Zisserman, 2015) and ResNet (He et al., 2016) . Details of Model A to D and CNN are described in Table 1 . The ResNet is same as in Song et al. (2018) . For CIFAR-10, we adopt the wide ResNet w32-10 ( Zagoruyko & Komodakis, 2016) . Details about the architectures of AT-GAN are provided in Appendix D. (Goodfellow et al., 2015) , ensemble adversarial training (Tramèr et al., 2018) and iterative adversarial training (Madry et al., 2018) . All experiments are conducted on a single Titan X GPU and the hyper-parameters used for attacks are described in Appendix D.

4.2. EVALUATION RESULTS

For evaluation, we report the comparisons on attack success rate, attack efficiency and visualize some adversarial examples for AT-GAN and the baselines. More evaluation results on the transferability, ablation study, human evaluation, and the attack results on CIFAR-10, are provided in Appendix D. 

4.2.1. COMPARISON ON ATTACK SUCCESS RATE

To validate the attack effectiveness, we compare AT-GAN with the baselines under white-box setting. Since Athalye et al. (2018) show that the currently most effective defense method is adversarial training, we consider adversarially trained models as the defense models. The attack success rates are reported in Table 2 . On MNIST, AT-GAN achieves the highest Attack Success Rate (ASR) against the baselines on all defense models. As for normal training, AT-GAN achieves the highest ASR on Model D, and the second highest ASR of over 98% on the other models. On Fashion-MNIST, AT-GAN achieves the highest ASR on average. On CelebA, AT-GAN achieves the highest ASR on almost all the models, with two exceptions under normal training but the results of AT-GAN are close to the highest. In general, AT-GAN achieves the highest attack performance above 90% on all the defense models. As AT-GAN aims to estimate the distribution of adversarial examples, adversarial training with some specific attacks has little robustness against AT-GAN, raising a new security issue for the development of more generalized adversarial training models. On MNIST, AT-GAN generates slightly more realistic images than Song's, e.g. "0" and "3". On Fashion-MNIST and CelebA, some adversarial examples generated by Song's method are not as realistic as AT-GAN to human perception, for example "t-shirt/top (0) ", "sandal (5)" and some facial details. Note that Song's method tends to distort the foreground that makes the images on MNIST more clean but some images are not realistic while AT-GAN tends to distort the background. As for perturbation-based attacks, their adversarial examples are not clear enough, especially on MNIST and Fashion-MNIST, due to the adversarial perturbations. There are also some unnatural samples generated by AT-GAN due to the limitation of GAN and we hope some better generative models can solve such issue. For target attack, please see more examples crafted by AT-GAN in Appendix D.

4.2.2. COMPARISON ON ATTACK EFFICIENCY

In general, AT-GAN can generate realistic and diverse adversarial examples as equation 1 forces the generated non-constrained adversarial examples to be close to the benign examples generated by the original generator.

4.3. VISUALIZATION ON ADVERSARIAL DISTRIBUTION

As discussed in Section 3.3, we provide a brief analysis that AT-GAN can learn a distribution of adversarial examples close to the distribution of real image data. To identify it empirically, we randomly choose 5, 000 benign images and 5, 000 adversarial examples generated by different attack methods, and merge these images according to their real label for MNIST and Fashion-MNIST. Then we use t-SNE (Maaten & Hinton, 2008) on these images to illustrate the distributions in two dimensions. t-SNE models each high-dimensional object in such a way that similar objects are To further validate that AT-GAN learns a different distribution from the original GAN rather than just adding some constant universal perturbation vector. In Appendix E, we illustrate some instances generated by the original generator and AT-GAN for the same input. We find that for different inputs, the original generator outputs different images and the difference between the instances generated by the original generator and AT-GAN is also different, indicating that AT-GAN indeed learns a different distribution from the original GAN. 

APPENDIX

In the appendix, we provide additional related work on gradient-based adversarial attack methods, adversarial training methods and typical generative adversarial nets. Then we describe how to obtain the original generator and provide theoretical analysis, as well as experimental details and additional results. In the end, we visualize the examples generated by original GAN and AT-GAN.

A ADDITIONAL RELATED WORK

A.1 GRADIENT-BASED ATTACKS Numerous adversarial attacks have been proposed in recent years (Carlini & Wagner, 2017; Liu et al., 2017; Bhagoji et al., 2017; Li et al., 2019) . In this part, we will introduce three typical adversarial attack methods. Here the components of all adversarial examples are clipped in [0, 1]. Fast Gradient Sign Method (FGSM). FGSM (Goodfellow et al., 2015) adds perturbation in the gradient direction of the training loss J on the input x to generate adversarial examples. x adv = x + • sign(∇ x J(θ, x, y true )), where y true is the true label of a sample x, θ is the model parameter and specifies the ∞ distortion between x and x adv . Projected Gradient Descent (PGD). PGD adversary (Madry et al., 2018 ) is a multi-step variant of FGSM, which applies FGSM for k iterations with a budget α. x advt+1 = clip(x advt +αsign(∇ x J(θ, x advt , y true )), x advt -, x advt + ) x adv0 = x, x adv = x adv k Here clip(x , p, q) forces its input x to reside in the range of [p, q] . Rand FGSM (R+FGSM). R+FGSM (Tramèr et al., 2018) first applies a small random perturbation on the benign image with a parameter α (α < ), then it uses FGSM to generate an adversarial example based on the perturbed image. x adv = x + ( -α) • sign(∇ x J(θ, x , y true )) where x = x + α • sign(N (0, I)).

A.2 ADVERSARIAL TRAINING

There are many defense strategies, such as detecting adversarial perturbations (Metzen et al., 2017) , obfuscating gradients (Buckman et al., 2018; Guo et al., 2018) and eliminating perturbations (Shen et al., 2017; Liao et al., 2018) , among which adversarial training is the most effective method (Athalye et al., 2018) . We list several adversarial training methods as follows. Adversarial training. Goodfellow et al. (2015) first introduce the method of adversarial training, where the standard loss function f for a neural network is modified as: J(θ, x, y true ) = αJ f (θ, x, y true ) + (1 -α)J f (θ, x adv , y true ). Here y true is the true label of a sample x and θ is the model's parameter. The modified objective is to make the neural network more robust by penalizing it to count for adversarial samples. During the training, the adversarial samples are calculated with respect to the current status of the network. Taking FGSM for example, the loss function could be written as: J(θ, x, y true ) =αJ f (θ, x, y true ) + (1 -α)J f (θ, x + sign(∇ x J(θ, x, y true )), y true ). Ensemble adversarial training. Tramèr et al. (2018) propose an ensemble adversarial training method, in which DNN is trained with adversarial examples transferred from a number of fixed pre-trained models. Iterative adversarial training. Madry et al. (2018) propose to train a DNN with adversarial examples generated by iterative methods such as PGD.

A.3 GENERATIVE ADVERSARIAL NET

Generative Adversarial Net (GAN) (Goodfellow et al., 2014) consists of two neural networks, G and D, trained in opposition to each other. The generator G is optimized to estimate the data distribution and the discriminator D aims to distinguish fake samples from G and real samples from the training data. The objective of D and G can be formalized as a min-max value function V (G, D): min G max D V (G, D) = E x∼px [log D(x)] + E z∼pz [log(1 -D(G(z)))]. Deep Convolutional Generative Adversarial Net (DCGAN) (Radford et al., 2016) is the convolutional version of GAN, which implements GAN with convolutional networks and stabilizes the training process. Auxiliary Classifier GAN (AC-GAN) (Odena et al., 2017) is another variant that extends GAN with some conditions by an extra classifier C. The objective function of AC-GAN can be formalized as follows: min G max D min C V D, C) =E x∼px [log D(x)] + E z∼pz [log(1 -D(G(z, y s )))] + E x∼px [log(1 -C(x, y s ))] + E z∼pz [log(1 -C(G(z, y s ), y s ))]. To make GAN more trainable in practice, Arjovsky et al. (2017) propose Wasserstein GAN (WGAN) that uses Wassertein distance so that the loss function has more desirable properties. Gulrajani et al. (2017) introduce WGAN with gradient penalty (WGAN_GP) that outperforms WGAN in practice. Its objective function is formulated as: min G max D V (D, G) = E x∼px [D(x)] -E z∼pz [D(G(z))] -λE x∼p x [( ∇ xD(x) 2 -1) 2 ], where p x is uniformly sampled along straight lines between pairs of points sampled from the data distribution p x and the generator distribution p g .

B TRAINING THE ORIGINAL GENERATOR

Figure 2 (a) illustrates the overall architecture of AC-WGAN_GP that we used as the normal GAN. AC-WGAN_GP is the combination of AC-GAN (Odena et al., 2017) and WGAN_GP (Gulrajani et al., 2017) , composed by three neural networks: a generator G, a discriminator D and a classifier f . The generator G takes a random noise z and a source label y s as the inputs and generates an image G(z, y s ). It aims to generate an image G(z, y s ) that is indistinguishable to discriminator D and makes the classifier f to output label y s . The loss function of G can be formulated as: L G = E z∼pz(z) [H(f (G(z, y s )), y s )] -E z∼pz(z) [D(G(z, y s ))]. Here H(a, b) is the entropy between a and b. The discriminator D takes the training data x or the generated data G(z, y s ) as the input and tries to distinguish them. The loss function of D with gradient penalty for samples x ∼ p x can be formulated as: L D = -E x∼p data (x) [D(x)] + E z∼pz(z) [D(G(z, y s ))] + λE x∼p x (x) [( ∇ xD(x) 2 -1) 2 ]. The classifier f takes the training data x or the generated data G(z, y s ) as the input and predicts the corresponding label. The loss function is: L f =E x∼p data (x) [H(f (x), y true )] + E z∼pz(z) [H(f (G(z, y s )), y s )]. Different from AC-WGAN_GP, styleGAN2-ada (Karras et al., 2020a) trains styleGAN2 (Karras et al., 2020b) with adaptive discriminator augmentation. We obtain the network and weights from Karras et al. (2020a) .

C THEORETICAL ANALYSIS OF AT-GAN

In this section, we provide proofs for theorems in Section 3.3. Theorem 1. Suppose max z,y L 2 < , we have KL(p a p g ) → 0 when → 0. Proof. We first consider that for a distribution p(x) in space X , we construct another distribution q(x) by selecting points p (x) in the -neighborhood of p(x) for any x ∈ X . Obviously, when p (x) is close enough to p(x), q(x) has almost the same distribution as p(x). Formally, we have the following lemma. Lemma 1. Given two distributions P and Q with probability density function p(x) and q(x) in space X , if there exists a constant satisfies q(x) -p(x) < for any x ∈ X , we could get KL(P Q) → 0 when → 0. Proof. For two distributions P and Q with probability density function p(x) and q(x), we could get q(x) = p(x) + r(x) where r(x) < . KL(P Q) = p(x) log p(x) q(x) dx = p(x) log p(x)dx -p(x) log q(x)dx = (q(x) -r(x)) log p(x)dx -(q(x) -r(x)) log q(x)dx = q(x) log p(x)dx -q(x) log q(x)dx -r(x) log p(x)dx + r(x) log q(x)dx = r(x) log q(x) p(x) dx -KL(Q P ) ≤ log(1 + p(x) )dx Obviously, when → 0, we could get log(1 + p(x) )dx → 0, which means DL(P Q) → 0. Now, we get back to Theorem 1. For two distributions p a and p g , max y,z L 2 < indicates ∀z ∼ p z , p a (z, •) -p g (z, •) < . According to Lemma 1, we have KL(p a p g ) → 0 when → 0. This concludes the proof. Theorem 2. The global minimum of the virtual training of AC-WGAN_GP is achieved if and only if p g = p data . Proof. To simplify the analysis, we choose a category y of AC-WGAN_GP and denote p g (x|y) and p data (x|y) the distribution that the generator learns and the distribution of real data respectively. Then for each category, the loss function is equivalent to WGAN_GP. We refers to Samangouei et al. (2018) to prove this property. The WGAN_GP min-max loss is given by: min G max D V (D, G) = E x∼p data (x) [D(x)] -E z∼pz(z) [D(G(z))] -λE x∼p x(x) [( ∇ xD(x) 2 -1) 2 ] = x p data (x)D(x)dx - z p z (z)D(G(z))dz -λ x p x(x)[( ∇ xD(x) 2 -1) 2 ]dx = x [p data (x) -p g (x)]D(x)dx -λ x p x(x)[( ∇ xD(x) 2 -1) 2 ]dx For a fixed G, the optimal discriminator D that maximizes V (D, G) should be: D * G (x) = 1 if p data (x) ≥ p g (x) 0 otherwise (6) According to equation 5 and equation 6, we could get:  V (D, G) = x [p data (x) -p g (x)]D(x)dx -λ x p x(x)[( ∇ xD(x) 2 -1) 2 ]dx = {x|p data (x)≥pg(x)} (p data (x) -p g (x))dx -λ x p x(x)dx = {x|p data (x)≥pg(x)} (p data (x) -p g (x))dx -λ

D.3 ABLATION STUDY

In this subsection, we investigate the impact of using different ρ in the loss function. As ρ could be constrained by both 0 and ∞ norm, we test various bounds, using Model A on MNIST dataset, for ρ in 0 and ∞ , respectively. We first fix ρ ∞ = 0.5 and try various values for ρ 0 , i.e. 0, 100, 200, 300, 400 (the maximum possible value is 784 for 28*28 input). The attack success rates are in Table 8 . We can observe that different values of ρ 0 only have a little impact on the attack success rates, and the performances are very close for ρ 0 = 0, 100, 200. Figure 5 further illustrates some generated adversarial examples, among which we can see that there exist some slight differences on the examples. When ρ 0 = 0, AT-GAN tends to change the foreground (body) of the digits. When we increase the value of ρ 0 (100 and 200), AT-GAN is more likely to add tiny noise to the background and the crafted examples are more realistic to humans (for instance, smoother on digit 4). But if we continue to increase ρ 0 (300 or 400), AT-GAN tends to add more noise and the quality of the generated examples decays. To have a good tradeoff on attack performance and generation quality, we set ρ 0 = 200. We then fix ρ 0 = 200 and test different values for ρ ∞ , i.e. 0, 0.1, 0.2, 0.3, 0.4, 0.5 (the maximum possible value is 1). The attack success rates are in Table 9 . We can observe that different values of ρ ∞ have very little impact on the attack performance. Figure 6 further illustrates some generated adversarial examples, among which we can see that a little bit more noises are added for bigger ρ ∞ but the differences are very tiny when ρ ∞ = 0.2 to 0.5. So we simply set ρ ∞ = 0.5 in experiments, but other values of ρ ∞ (0.2, 0.3, 0.4) also work. ASR 98.9 99.2 98.9 98.9 98.9 98.7 We then conduct human evaluation to determine whether each example is realistic. The evaluation results are in Table 10 . We see that adversarial examples in some categories (e.g. 2, 4) are harder to be semantically meaningful than other categories (e.g. 0, 1). On average, however, the generating capability is close to that of the original generator. Table 10 : The evaluation results on the percentage of realistic images by human evaluation. ). The attack success rates are in Table 11 . On normally trained models, PGD achieves the attack success rate of 100% while AT-GAN achieves the attack success rate of 93.5%. However, the adversarially trained model exhibits little robustness against AT-GAN and AT-GAN achieves attack success rate of 73.0%. In Figure 7 , we illustrate some generated adversarial examples on CIFAR-10 dataset. 

E VISUALIZATIONS FOR THE ORIGINAL GAN AND AT-GAN

Here we provide some instances generated by the original GAN and AT-GAN with the same input noise and their difference on MNIST and Fashion-MNIST. The results are depicted in Figure 10 and 11. For different input noise, both the original GAN and AT-GAN output different instances. For each category with the same input noise, the difference between original GAN and AT-GAN is mainly related to the main content of image. For two different input noises, the differences between the original GAN and AT-GAN are not the same with each other, indicating that AT-GAN learns a distribution of adversarial examples different from the original GAN rather than just adds some universal perturbation vectors on the original GAN. 



Figure 1: Estimating the distribution of adversarial examples q θ in two stages: 1) estimate the distribution of benign data p θ . 2) transfer p θ to estimate q θ . for the input image. AI-GAN uses adversarial examples generated by PGD for the training. In contrast, AT-GAN does not use any natural image as the input, and generates adversarial examples directly from any random noise. Further, compared with AI-GAN, we do not use any adversarial examples for the training.

Figure 2: The architecture of AT-GAN. The first training stage of AT-GAN is same as AC-WGAN_GP. Once trained, we regard G as the original model G original and transfer G original to attack target classifier to obtain G attack . Once finished, AT-GAN can generate adversarial examples by G attack .

Adversarial examples for CNN on CelebA.

Figure 3: Adversarial examples generated by various attacks on three datasets (Zoom in for details). The red borders indicate unrealistic adversarial examples generated by Song's method or AT-GAN.

Figure 4: T-SNE visualization for the combination of testset and adversarial examples generated by various attacks on MNIST (top) and Fashion-MNIST (bottom). For (a), we use total testset with 10,000 images. For (b) to (f), we use 5,000 sampled images in testset and 5,000 adversarial examples generated by various attacks. The position of each class is random due to the property of t-SNE.

3, α = 0.075, epochs = 20 = 0.1, α = 0.01, epochs = 20 = 0.015, α = 0.005, epochs = 20 ∞ R+FGSM = 0.3, α = 0.15 = 0.2, α = 0.1 = 0.015, α = 0.003 ∞ Song's λ 1 = 100, λ 2 = 0, epochs = 200 λ 1 = 100, λ 2 = 0, epochs = 200 λ 1 = 100, λ 2 = 100, epochs = 200 N/A AT-GAN α = 2, β = 1, epochs = 100 α = 2, β = 1, epochs = 100 α = 3, β = 2, epochs = 200 N/A

Figure 5: The adversarial examples generated by AT-GAN for various values of p 0 .

Figure 6: The adversarial examples generated by AT-GAN for various values of p ∞ .D.4 HUMAN EVALUATIONTo investigate the generating capacity of AT-GAN, we use the same input, and randomly pick 100 images for each category of MNIST generated by AT-GAN and the original generator, respectively. We then conduct human evaluation to determine whether each example is realistic. The evaluation results are in Table10. We see that adversarial examples in some categories (e.g. 2, 4) are harder to be semantically meaningful than other categories (e.g. 0, 1). On average, however, the generating capability is close to that of the original generator.

-GAN ON CIFAR-10 DATASET To further demonstrate the flexibility of AT-GAN, we implement AT-GAN on CIFAR-10 dataset using StyleGAN2-ada (Karras et al., 2020a), a recently proposed conditional GAN. The target classifier is wide ResNet w32-10 (Zagoruyko & Komodakis, 2016) by normal training (Nor.) and Iterative adversarial training (Iter.

Figure 10: The instances generated by the original GAN and AT-GAN with the same input on MNIST. First row: the output of original GAN. Second row: the output of AT-GAN. Third row: The difference between the above two rows.

Figure 11: The instances generated by the original GAN and AT-GAN with the same input on Fashion-MNIST. First row: the output of original GAN. Second row: the output of AT-GAN. Third row: The difference between the above two rows.

Architectures of Model A through D used for MNIST and Fashion-MNIST and CNN used for CelebA. The total number of parameters of each model is provided after the model name.

Attack success rate (ASR, %) of adversarial examples generated by AT-GAN and the baseline attacks against models by normal training and various adversarial training methods. For each model, the highest ASR is highlighted in bold. Notation: Nor. -Normal training, Adv. -Adversarial training, Ens. -Ensemble adversarial training, Iter. -Iterative adversarial training. (a) Comparison of attack success rate on MNIST.

Comparison on the average example generation time, measured by generating 1000 adversarial instances using Model A on MNIST.

In this work, we propose a generation-based adversarial attack method, called AT-GAN (Adversarial Transfer on Generative Adversarial Net), that aims to learn the distribution of adversarial examples for the target classifier. The generated adversaries are "non-constrained" as we do no search at all in the neighborhood of the input, and once trained AT-GAN can output adversarial examples directly for any input noise drawn from arbitrary distribution (e.g. Gaussian distribution). Extensive experiments and visualizations show that AT-GAN achieves highest attack success rates against adversarially trained models and can generate diverse and realistic adversarial examples efficiently.

Architecture of WGAN_GP with auxiliary classifier for MNIST and Fashion-MNIST.

Hyper-parameters of different attack methods on MNIST, Fashion-MNIST and CelebA.

Transferability of non-constrained adversarial examples and other search-based adversarial examples on three datasets. For MNIST and Fashion-MNIST, we attack Model C with adversarial examples generated on Model A. For CelebA dataset, we attack VGG16 using adversarial examples generated on CNN. Numbers represent the attack success rate (%).

Attack success rate (ASR, %) of AT-GAN with various values for ρ 0 using Model A on MNIST dataset.

Attack success rate (ASR, %) of AT-GAN with various values for ρ ∞ using Model A on MNIST dataset.

Attack success rate (%) of adversarial examples generated by FGSM, PGD and AT-GAN against wide ResNet w32-10 by normal training (Nor.) and iterative adversarial training (Iter.).

annex

Let X = {x|p data (x) ≥ p g (x)}, in order to minimize equation 7, we set p data (x) = p g (x) for any x ∈ X . Then, since both p g and p data integrate to 1, we could get:However, this contradicts equation 6 where p data (x) < p g (x) for x ∈ X c , unless µ(X c ) = 0 µ is the Lebesgue measure.Therefore, for each category we have p g (x|y) = p data (x|y), which means p g (x) = p data (x) for AC-WGAN_GP.

D ADDITIONAL DETAILS ON EXPERIMENTS

In this section, we provide more details on experimental setup, report results on transferability, do ablation study on hyper-parameters, investigate the generating capacity by human evaluation, and show details for another implementation of AT-GAN on CIFAR-10 dataset. In the end, we illustrate some non-constrained adversarial examples generated by AT-GAN on MNIST, Fashion-MNIST and CelebA for the target attack.

D.1 MORE EXPERIMENTAL SETUP

We first provide more details on the experimental setup, including the model architectures and attack hyper-parameters.Model Architectures for AT-GAN. We first describe the neural network architectures used for AT-GAN in experiments. The abbreviations for components in the network are described in Table 4. The architecture of AC-WGAN_GP for MNIST and Fashion-MNIST is shown in Table 5 where the generator and discriminator are the same as in Chen et al. (2016) , while the architecture of AC_WGAN_GP for CelebA is the same as in Gulrajani et al. (2017) and the architecture of styleGAN2-ada for CIFAR-10 is the same as in Karras et al. (2020a) .Hyper-parameters for Attacks. The hyper-parameters used in experiments for each attack method are described in Table 6 for MNIST, Fashion-MNIST and CelebA datasets. For CIFAR-10 dataset, we set = 0.03 for FGSM, = 0.03, α = 0.0075 and epochs= 20 for PGD, α = 3, β = 2 and epochs= 1, 000 for AT-GAN. 

D.6 AT-GAN ON TARGET ATTACK

Here we show some non-constrained adversarial examples generated by AT-GAN for the target attack.The results are illustrated in Figure 8 for MNIST and Fashion-MNIST, and Figure 9 for CelebA. Instead of adding perturbations to the original images, AT-GAN transfers the generative model (GAN) so that the generated adversarial instances are not in the same shape of the initial examples (in diagonal) generated by the original generator. Note that for CelebA, the target adversarial attack is equivalent to the untarget adversarial attack as it is a binary classification task. 

