DISENTANGLING ADVERSARIAL ROBUSTNESS IN DI-RECTIONS OF THE DATA MANIFOLD

Abstract

Using generative models (GAN or VAE) to craft adversarial examples, i.e. generative adversarial examples, has received increasing attention in recent years. Previous studies showed that the generative adversarial examples work differently compared to that of the regular adversarial examples in many aspects, such as attack rates, perceptibility, and generalization. But the reasons causing the differences between regular and generative adversarial examples are unclear. In this work, we study the theoretical properties of the attacking mechanisms of the two kinds of adversarial examples in the Gaussian mixture model. We prove that adversarial robustness can be disentangled in directions of the data manifold. Specifically, we find that: 1. Regular adversarial examples attack in directions of small variance of the data manifold, while generative adversarial examples attack in directions of large variance. 2. Standard adversarial training increases model robustness by extending the data manifold boundary in directions of small variance, while on the contrary, adversarial training with generative adversarial examples increases model robustness by extending the data manifold boundary directions of large variance. In experiments, we demonstrate that these phenomena also exist on real datasets. Finally, we study the robustness trade-off between generative and regular adversarial examples. We show that the conflict between regular and generative adversarial examples is much smaller than the conflict between regular adversarial examples of different norms.

1. INTRODUCTION

In recent years, deep neural networks (DNNs) (Krizhevsky et al. (2012) ; Hochreiter and Schmidhuber (1997) ) have become popular and successful in many machine learning tasks. They have been used in different problems with great success. But DNNs are shown to be vulnerable to adversarial examples (Szegedy et al. (2013) ; Goodfellow et al. (2014a) ). A well-trained model can be easily attacked by adding a small perturbation to the image. An effective way to solve this issue is to train the robust model using training data augmented with adversarial examples, i.e. adversarial training. With the growing success of generative models, researchers have tried to use generative adversarial networks (GAN) (Goodfellow et al. (2014b) ) and variational autoencoder (VAE) (Kingma and Welling (2013) ) to generate adversarial examples (Xiao et al. (2018) ; Zhao et al. (2017) ; Song et al. (2018a) ; Kos et al. (2018) ; Song et al. (2018b) ) to fool the classification model with great success. They found that standard adversarial training cannot defend these new attacks. Unlike the regular adversarial examples, these new adversarial examples are perceptible by humans but they preserve the semantic information of the original data. A good DNN should be robust to such semantic attacks. Since the GAN and VAE are approximations of the true data distribution, these adversarial examples will stay in the data manifold. Hence they are called on-manifold adversarial examples by (Stutz et al. (2019) ). On the other hand, experimental evidences support that regular adversarial examples leave the data manifold (Song et al. (2017) ). We call the regular adversarial examples as off-manifold adversarial examples. The concepts of on-manifold and off-manifold adversarial examples are important. Because they can help us to understand the issue of conflict between adversarial robustness and generalization (Stutz et al. (2019) ; Raghunathan et al. (2019) ), which is still an open problem. In this paper, we study the attacking mechanisms of these two types of examples, as well as the corresponding adversarial training methods. This study, as far as we know, has not been done before. Specifically, we consider a generative attack method that adds a small perturbation in the latent space of the generative models. Since standard adversarial training cannot defend this attack, we consider the training methods that use training data augmented with these on-manifold adversarial examples, which we call latent space adversarial training. Then we compare it to standard adversarial training (training with off-manifold adversarial examples). Contributions: We study the theoretical properties of latent space adversarial training and standard adversarial training in the Gaussian mixture model with a linear generator. We give the excess risk analysis and saddle point analysis in this model. Based on this case study, we claim that: • Regular adversarial examples attack in directions of small variance of the data manifold and leave the data manifold. • Standard adversarial training increases the model robustness by amplifying the small variance. Hence, it extends the boundary of the data manifold in directions of small variance. • Generative adversarial examples attack in directions of large variance of the data manifold and stay in the data manifold. • Latent space adversarial training increases the model robustness by amplifying the large variance. Hence, it extends the boundary of the data manifold in directions of large variance. We provide experiments on MNIST and CIFAR-10 and show that the above phenomena also exist in real datasets. 

2. RELATED WORK

Our work is related to attack and defense methods. Specifically, we care about attacks and defenses with generative models. Attack Adversarial examples for deep neural networks were first intruduced in (Szegedy et al. (2013) ). However, adversarial machine learning or robust machine learning has been studied for a long time (Biggio and Roli (2018) ). In the setting of white box attack (Kurakin et al. (2016) 2017)). However, some of them are shown to be useless defenses given by obfuscated gradients (Athalye et al. (2018) ). Adaptive attack (Tramer et al. (2020) ) is used for evaluating defenses to adversarial examples. Defense with generative model Using generative models to design defense algorithms have been studied extensively. Using GAN, we can project the adversarial examples back to the data manifold (Jalal et al. (2017) ; Samangouei et al. (2018)) . VAE is also used to train robust model (Schott et al. (2018) ).

3. PROBLEM DESCRIPTION

Original space adversarial training: Consider the classification problem of training a classifer f θ to map the data points x ∈ X ⊂ R d to the labels y ∈ Y, where X and Y are the input data space and the label space. The classifier f θ is parameterized by θ. We assume that the data pairs (x, y) are sampled from the distribution P (X, Y ) over X × Y. Standard training is to find the solution of min θ E (x,y)∼P (f θ (x), y), where (•, •) is the loss function. The goal of dversarial training is to solve the minimax problem min θ E (x,y)∼P max x-x ≤ε (f θ (x ), y), ( ) where ε is the threshold of perturbation. Here we can use 1 , 2 or ∞ -norm (Madry et al. (2017) ). The inner maximization problem is to find the adversarial examples x to attack the given classifier f θ . The outer minimization problem is to train the classifier to defend the given adversarial examples x . we refer to these attacks as regular attacks. We refer to these minimax problems as standard adversarial training or original space adversarial training. Latent space adversarial training: We assume that the data lie in a low dimensional manifold of R d . Furthermore, we assume the true distribution D is a pushforward from a prior Guassian distribution z ∼ N (0, I) using G(z), where G : Z → X is a mapping from the latent space Z to the original space X . This is a basic assumption of GAN or VAE. Let I : X → Z be the inverse mapping of G(z). The goal of latent space adversarial training is to solve the following minimax problem min θ E (x,y)∼P max z -I(x) ≤ε (f θ (G(z )), y). Unlike the regular attacks, the distance between the original examples and adversarial examples can be large. To preserve the label of the data, we use the conditional generative models (e.g. C-GAN (Mirza and Osindero (2014) ) and C-VAE (Sohn et al. (2015) )), i.e. the generator G y (z) and inverse mapping I y (x) are conditioned on the label y, for adversarial training. We refer to these attacks as generative attack, and these adversarial training as latent space adversarial training. Regular attack algorithms Two widely used gradient-based attack algorithms for the inner maximization problem in equation ( 1) are fast gradient sign method (FGSM) (Goodfellow et al. (2014a) ) and projected gradient descend (PGD) (Madry et al. (2017) ). Using FGSM, the adversarial examples are calculated by x = x + εsgn(∇ x (f θ (x), y)), where ∇ x denotes the gradient with respect to x. PGD attempts to find a near optimal adversarial example for the inner maximization problem (1) in multiple steps. In the t th step, x t+1 = Π x+S [x t + α∇ x (f θ (x t ), y)/ ∇ x (f θ (x t ), y) ], where α is the step size, Π x+S [•] is the projection operator to project the given vector to the constraint x + S = {x | x -x ≤ ε}. In the whole paper, we refer to these as FGSM-attack and PGD-attack, and the corresponding original space adversarial training as FGSM-adv and PGD-adv. FGSM-attack is a weak attack and PGD-attack is a stronger attack. In section 5, we use them to show that a strong original space adversarial training, PGD-adv, does not work well against a weak attack, FGSM-attack in the latent space. Conversely, latent space adversarial training cannot defend a simple FGSM-attack in the original space.

Generative attack algorithm

In our experiments, we use FGSM in the latent space for the inner maximization problem in equation ( 2) z = I(x) + εsgn(∇ z (f θ (G(z)), y)). Because of the mode collapse issue of GAN (Salimans et al. (2016) ; Gulrajani et al. (2017) ), adding a small perturbation in the latent space of GAN may output the same images. Thus we use VAE in our experiments. we refer to this generative attack and latent space adversarial training as VAE-attack and VAE-adv.

4. THEORETICAL ANALYSIS

In this section, we study the difference between adversarial training in the latent space and in the original space. We study the simple binary classification setting proposed by Ilyas et al. (2019) . The main reason for using this model is that we can find the optimal closed-form solution, which gives us insights to understand adversarial training. For a more complex model, we can only solve it numerically. We leave all the proofs of the lemmas and theorems in appendix A.

4.1. THEORETICAL MODEL SETUP

Gaussian mixture (GM) model Assume that data points (x, y) are sampled according to y ∼ {-1, 1} unifomly and x ∼ N (yµ * , Σ * ), where µ * and Σ * denote the true mean and covariance matrix of the data distribution. For the data in the class y = -1, we replace x by -x, then we can view the whole dataset as sampled from D = N (µ * , Σ * ). Classifier The goal of standard training is to learn the parameters Θ = (µ, Σ) such that Θ = arg min µ,Σ L(µ, Σ) = arg min µ,Σ E x∼D [ (x; µ, Σ)], where (•) represents the negative log-likelihood function. The goal of adversarial training is to find Θ r = arg min µ,Σ L r (µ, Σ) = arg min µ,Σ E x∼D [ max x-x ≤ε (x ; µ, Σ)]. We use L and L r to denote the standard loss and adversarial loss. After training, we classify a new data point x to the class sgn(µ T Σ -1 x). Generative model In our theoretical study, we use a linear generative model, that is, probabilistic principle components analysis (P-PCA) (Tipping and Bishop (1999)). P-PCA can be viewed as linear GAN (Feizi et al. (2017) ; Feizi et al. (2020) ) and linear VAE (Dai et al. (2017) ). Given dataset {x i } n i=1 ⊂ R d , let µ and S be the sample mean and sample covariance matrix. The eigenvalue decomposition of S is S = U ΛU T , then using the first q eigenvactors, we can project the data to a low dimensional space. P-PCA is to assume that the data are generated by x = W z + µ + where z ∼ N (0, I) , ∼ N (0, σ 2 I), z ∈ R q and W ∈ R d×q . Then we have x ∼ N (µ, W W T + σ 2 I), x|z ∼ N (W z + µ, σ 2 I) and z|x ∼ N (P -1 W T (x-µ), σ 2 P -1 ) where P = W T W +σ 2 I. The maximum likelihood estimator of W and σ 2 are W ML = U q (Λ q -σ 2 ML I) 1/2 and σ 2 ML = 1 d -q d i=q+1 λ i , where U q is the matrix of the first q columns of U , Λ q is the matrix of the first q eigenvalues of Λ. In the following study, we assume that n is large enough such that we can learn the true µ * and Σ * . Thus we have S = Σ * , U q = U q * , Λ q = Λ q * for the generative model.

4.2. MINIMAX PROBLEM OF LATENT SPACE ADVERSARIAL TRAINING

To perturb the data in the latent space, Data will go through the encode-decode process x → z → ∆z + z → x . Based on the probabilistic model, we may choose z with the highest probability or just sample it from the distribution we learned. Hence, we could have different strategies. Below we list 3 strategies. Strategy 1 is used in practice and the other two are alternative choices. In lemma 1 we show that these strategies are equivalent under the low dimensional assumption. Hence, we do not need to worry about the effect of sampling strategies. Strategy 1: Sample x ∼ D, encode z = arg max q(z|x) = P -1 W T (x -µ * ), add a perturbation ∆z, and finally, decode x adv = arg max p(x|z + ∆z) = W (z + ∆z) + µ * . Strategy 2: Sample x ∼ D, then sample z ∼ q(z|x), add a perturbation ∆z, and finally, sample x adv ∼ p(x|z + ∆z). Strategy 3: Sample z ∼ N (0, I), add a perturbation ∆z, and then sample x adv ∼ p(x|z + ∆z). In this strategy, x adv can be viewed as the adversarial example of x = arg max x q(z|x). The following lemma shows that the adversarial examples can be unified in one formula. Hence sampling strategies will not affect our analysis. Lemma 1 (Adversarial examples perturbed in the latent space). Using these 3 strategies, the adversarial examples can be unified as x adv = x + W ∆z and x ∼ D j = N (µ * , U * Λ (j) U T * ), j = 1, 2, 3, where Λ (1) = (Λ q -σ 2 I) 2 Λ -1 q 0 0 0 , Λ (3) = Λ q 0 0 σ 2 I , Λ (2) = (Λ q -σ 2 I) 2 Λ -1 q + (Λ q -σ 2 I)Λ -1 q σ 2 + σ 2 I 0 0 σ 2 I . If the data lie in a q dimensional subspace, i.e. the covariance matrix Σ * is rank q, we have Λ (1) = Λ (2) = Λ (3) = Λ * . Then D = D. In general, the adversarial example can be decomposed into 2 parts, the change of distribution x ∼ D and the small perturbation W ∆z. Therefore the adversarial expected risk can be written as the following minimax problem min µ,Σ L ls (µ, Σ; D j ) = min µ,Σ E x ∼D j max ∆z ≤ε (x + W ∆z, µ, Σ), j = 1, 2, 3. We aim to analyze the different properties between the minimax problems in equations ( 4) and ( 6). We give the excess risk and optimal saddle point analysis in the following subsections.

4.3. EXCESS RISK ANALYSIS

We consider the difference between L ls and L given the true Θ * , i.e. L ls (Θ * ; D j ) -L(Θ * ; D). It characterizes the excess risk incured by the optimal perturbation. To derive the expression of excess risk, we decompose it into two parts L ls (Θ * ; D j ) -L(Θ * ; D) = L ls (Θ * ; D j ) -L(Θ * ; D j ) perturbation + L(Θ * ; D j ) -L(Θ * ; D) change of distribution . ( ) To simplify the notation, we consider the Lagrange penalty form of the inner maximization problem in equation ( 6), i.e. max (x + W ∆z, µ, Σ) -L ∆z 2 /2, where L is the Lagrange multiplier. The following theorem gives the solution in the general case. Theorem 2 (Excess risk). Let L ls and L be the loss with and without perturbation in latent space (equations ( 6) and (3) respectively), given the non-robustly learned Θ * = (µ * , Σ * ), thus the excess risk caused by perturbation is L ls (Θ * , D j ) -L(Θ * , D j ) = 1 2 q i=1 (1 + λ i -σ 2 (L -1)λ i + σ 2 ) 2 -1 λ (j) i λ i , j = 1, 2, 3. and the excess risk caused by the changed of distribution is L(Θ * , D j ) -L(Θ * , D) = 1 2 log d i=1 λ (j) i d i=1 λ i + 1 2 d i=1 λ (j) i λ i -d . It is hard to see which part dominates the excess risk. If we further assume that the data lie in a q dimensional manifold, the excess risk caused by the change of distribution becomes 0 by Lemma 1. We have the following corollary. Corollary 3 (Excess risk). Let L ls and L be the loss with or without perturbation in latent space (equation ( 6) and (3) respectively), given the non-robustly learned Θ * = (µ * , Σ * ), and rank(Σ * ) = q. The excess risk L ls (Θ * , D j ) -L(Θ * , D) = O(qL -2 ). The optimal perturbation in the latent space will incur an excess risk in O(qL -2 ). The adversarial vulnerability depends on the dimension q and the Lagrange multiplier L. It does not depend on the shape of the data manifold. This is because the perturbation constraint (the black block) aligns with the shape of the data manifold (the ellipse) as we demonstrate in Figure 1 (c). Thus generative attacks will focus on the directions of the largest q variance. Then, we analyze the excess risk of original space adversarial training. Since the perturbation thresholds, ε, are on different scales in the original space attack and latent space attack, the corresponding Lagrange multipliers L are different. We use L for original space adversarial training in the following Theorem. Theorem 4 (Excess risk of original space adversarial training). Let L r and L be the loss with or without perturbation in original space (equations ( 4) and ( 3) respectively), given the non-robustly learned Θ * = (µ * , Σ * ). Denote λ min be the smallest eigenvalue of Σ * . The excess risk is Ω((λ min L ) -2 ) ≤ L r (Θ * , D) -L(Θ * , D) ≤ O(d(λ min L ) -2 ). If the data lie in a low dimensional manifold, i.e. λ min = 0, the excess risk equals to +∞. Optimal perturbation in the original space will incur an excess risk in O(d(λ min L ) -2 ). The adversarial vulnerability depends on the smallest eigenvalues λ min , the dimension d, and the Lagrange multiplier L . The λ min comes from the misalignment between the perturbation constraint (the black block) and the shape of the data manifold (the ellipse) as we demonstrate in Figure 1 (a) . Notice that λ min also appears in the lower bound. Hence, the excess risk equals to +∞ when λ min = 0. Thus regular attacks focus on the directions of small variance. Specifically, when λ min = 0, regular adversarial examples leave the data manifold.

4.4. SADDLE POINT ANALYSIS

In this subsection we study the optimal solution of optimization problem (6). Since it is not a standard minimax problem, we consider a modified problem: min µ,Σ max E x ∆z =ε E x ∼D j (x + W ∆z, µ, Σ), j = 1, 2, 3. We will explain more about the connection between optimization problems in equation ( 6) and (8). See appendix A. The following theorem is our main result. It gives the optimal solution of latent space adversarial training. Theorem 5 (Main result: Optimal Saddle point). The optimal solution of the modified problem in equation ( 8) is µ ls = µ * and Σ ls = U * Λ ls U T * , where λ ls i = 1 4 2λ (j) i + 4(λ i -σ 2 ) L + 2λ (j) i 1 + 4(λ i -σ 2 ) λ (j) i L for i = 1 ≤ q, λ ls i = λ (j) i for i > q, and j = 1, 2, 3 corresponding to strategies 1,2 and 3. We assume that the data lie in a q-dimensional manifold again. Then we have λ ls i /λ i = 1/2 + 1/L + 1/4 + 1/L ≥ 1 for i ≤ q and λ ls i /λ i = 0 for i > q. Latent space adversarial training increases the model robustness by amplifying large eigenvalues of the data manifold. The illustration of the two dimensional case is in Figure 1, (c ) and (d). In the same setting, the optimal solution of standard adversarial training (Problem (4)) on the original space is in Theorem 6 (which is Theorem 2 in Ilyas et al. ( 2019)). Theorem 6 (Optimal saddle point Ilyas et al. (2019) ). The optimal solution of the problem in equation ( 4) is µ r = µ * and Σ r = 1 2 Σ * + 1 L I + 1 L Σ * + 1 4 Σ 2 * . Theorem 6 is for the problem that the covariance matrix is restricted to be diagonal. Consider the ratio λ (r) i λ i = 1 2 + 1 L λ i + 1 L λ i + 1 4 . For small true eigenvalue λ i , the ratio is large. Standard adversarial training increases the robustness by amplifying the small eigenvalues of the data manifold. The illustration of the two dimensional case in Figure 1 , (a) and (b). We show all the eigenvalues in the first column in Figure 2 . After adding a small perturbation in the original space or in the latent space, the distribution of D r test and D ls test are close to D test . We show the details of the figure of large eigenvalues (the first 30 eigenvalues) in the second column and small eigenvalues (last 754 eigenvalues) in the last column. Original space adversarial training focuses on the small variance direction We adversarially train a robust classifier f r (x) in the original space. The robust test set D r test against f r (x) amplified the small eigenvalues a lot. In the third column, the orange line is significantly larger that the other 2 lines. While in the second column, the orange line is below the other two lines. The experiments give us an understanding of how the adversarial examples leave the data manifolds. The original dataset D test lies in a low dimension affine plane in R 784 . After adversarial training, the data move towards the small variance directions. Finally D r test stay in a full dimension subspace. Latent space adversarial training focuses on the large variance direction We adversarially train a robust classifier f ls (x) in the latent space. The latent space robust test set D ls test against f ls (x) amplified the large eigenvalues. In the second column, we can see that the green line is above the other two lines. And in the last column, the green line shows that adversarial training in the latent space will not affect the small eigenvalues. The adversarial examples in D ls test move along the large variance direction. Therefore, D ls test stay in the low dimension affine space. The first column is about all the 784 eigenvalues of the dataset. The second column plots the large eigenvalues and the last column plots the small eigenvalues.

5.2. ROBUST TEST ACCURACIES

In this subsection, we compare the test accuracy of different attacks (FGSM, PGD, VAE) versus different defense (adversarial training with FGSM, PGD, and VAE) to help explain our theoretical results. We explain the results of MNIST in Table 1 as an example.

On-manifold and off-manifold adversarial examples

The test accuracies of the standard training model on PGD and VAE-attack data are 3.9% and 42.4% respectively. Firstly, They show that on-manifold adversarial examples exist. Then, we can see that on-manifold adversarial examples are harder to find. Attack versus defense Our theory tells us that original or latent space adversarial training increase the model robustness by amplifying the small or large eigenvalues, which are the variance of the distribution, in the data manifold. In experiments, the test accuracy of VAE-adv vs PGD-attack is 1.23%, which shows that extending the manifold boundary in directions of large variance gives no contribution to defending attacks in directions of small variance. Similarly, the test accuracy of PGD-adv vs VAE-attack is 52.18%, which shows that amplifying the boundary in the directions of small variance does not work well in defending attacks in directions of high variance. Further more, we can see that latent space adversarial training does not work well against a simple FGSM attack, and vice versa. We see that PGD-adv can increase the test accuracies on VAE-attack from 42% to 52% on MNIST and from 19.40% to 26.31% on CIFAR-10. A possible reason is that original space adversarial training can amplify the small eigenvalues of the first q dimension but fails to amplify the larger ones. As it is indicated in Figure 2 , both original space and latent space adversarial training will increase the eigenvalues which are small but not equal to zero (around the 100th eigenvalue) of the covariance matrix.

5.3. ROBUSTNESS TRADE-OFF

Robustness trade-off are common in practice (Su et Table 2 : Comparison of the robustness trade-off between 1 and ∞ attack and the trade-off between original space and latent space attack on MNIST and CIFAR-10. Our theory suggests that adversarial robustness can be disentangled in different directions. Hence adversarial robustness against attacks on directions of low variance and large variance can be guaranteed simultaneously. On MNIST, the jointly-trained model decreases the test accuracies by 6% comparing to the single trained model. The conflict between on-manifold and off-manifold adversarial examples is much small than the conflict between off-manifold adversarial examples of different norms. On CIFAR-10, the jointly-trained model gets test accuracies 74% on PGD-attack and 42% on VAE-attack, which exhibit nearly no robustness trade-off. Therefore, if our goal is to defend mixture of regular and generative attacks, The jointly-train model will perform well. Discussion Under the q-dimensional manifold assumption, there is an overlap between the directions of the q largest variance and the directions of small variance. This is supported by Theorem 5, Theorem 6, and the first experiment. A possible reason for the robustness trade-off between regular and generative attacks is that they conflict with each other in the overlap directions. The conflict is not serious because they mainly focus on different directions.

6. CONCLUSION

In this paper, we show that adversarial robustness can be disentangled in directions of small variance and large variance of the data manifold. Theoretically, we study the excess risk and optimal saddle point of the minimax problem of latent space adversarial training. Experimentally, we show that these phenomena also exist in real datasets. Future works One may design defense algorithms based on this property. We can generate adversarial examples based on the directions of the dataset without access to the model architecture. In white-box settings, we can use them for data augmentation to accelerate adversarial training or even increase the model robustness. However, we need to carefully compare the computational cost between calculating the EVD of the datasets and that of standard adversarial training. In black-box attacks, we can use them to query the target model. But the data manifolds should be transferred from some known datasets (since there is no information available for the training dataset of the target model). Theoretically, it is unclear whether we can find the closed-form solution under nonlinear models. How to analyze the nonlinear model is an open problem. Using linear models, we can further analyze the conflict between robustness and generalization.

OVERVIEW

In appendix A, we provide the proof of the Theorems. In appendix B, we show the settings about the experiments. Appendix C is a further discussion of data augmentation using generative models. It is not closely related to our main paper.

A PROOF OF THE THEOREMS

A.1 PROBLEM DESCRIPTION Lemma 1 (Adversarial examples perturbed in the latent space). Using these 3 strategies, the adversarial examples can be unified as x adv = x + W ∆z and x ∼ D j = N (µ * , U * Λ (j) U T * ), j = 1, 2, 3, where Λ (1) = (Λ q -σ 2 I) 2 Λ -1 q 0 0 0 Λ (3) = Λ q 0 0 σ 2 I Λ (2) = (Λ q -σ 2 I) 2 Λ -1 q + (Λ q -σ 2 I)Λ -1 q σ 2 + σ 2 I 0 0 σ 2 I . If the data lie in a q dimensional subspace, i.e. the covariance matrix Σ * is rank q, we have Λ (1) = Λ (2) = Λ (3) = Λ * . Then D = D. Proof: Recall that x ∼ N (µ, W W T + σ 2 I), x|z ∼ N (W z + µ, σ 2 I) and z|x ∼ N (P -1 W T (x - µ), σ 2 P -1 ) where P = W T W + σ 2 I. The maximum likelihood estimator of W and σ 2 are W ML = U q (Λ q -σ 2 I) 1/2 and σ 2 ML = 1 d -q d i=q+1 λ i . Strategy 1 Sample x ∼ D, encode z = arg max q(z|x) = P -1 W T (x -µ * ), add a perturbation ∆z, and finally, decode x adv = arg max p(x|z + ∆z) = W (z + ∆z) + µ * . Then x adv =W (P -1 W T (x -µ * ) + ∆z) + µ * =W P -1 W T (x -µ * ) + µ * + W ∆z =x + W ∆z. Since x ∼ (µ * , Σ * ), we have x -µ * ∼ (0, Σ * ), Then x ∼ N (µ * , W P -1 W T Σ * (W P -1 W T ) T ), With W P -1 W T Σ * (W P -1 W T ) T =U * Λ q -σ 2 0 0 0 1/2 Λ q 0 0 σ 2 I -1 Λ q -σ 2 0 0 0 1/2 Λ * Λ q -σ 2 0 0 0 1/2 Λ q 0 0 σ 2 I -1 Λ q -σ 2 0 0 0 1/2 U T * =U * (Λ q -σ 2 I) 2 Λ -1 q 0 0 0 U T * =U * Λ (j) U T * , j = 1. Strategy 2 Sample x ∼ D, then sample z ∼ q(z|x), add a perturbation ∆z, and finally, sample x adv ∼ p(x|z + ∆z). Then z ∼ N (0, P -1 W T Σ * (P -1 W T ) T + σ 2 P -1 ) and x adv ∼ N (µ * + W ∆z, W P -1 W T Σ * (P -1 W T ) T W T + W σ 2 P -1 W T + σ 2 I), x adv = x + W ∆z, With W P -1 W T Σ * (P -1 W T ) T W T + W σ 2 P -1 W T + σ 2 I =U * (Λ q -σ 2 I) 2 Λ -1 q 0 0 0 U T * + U * (Λ q -σ 2 I)Λ -1 q σ 2 0 0 0 U T * + σ 2 I =U * (Λ q -σ 2 I) 2 Λ -1 q + (Λ q -σ 2 I)Λ -1 q σ 2 + σ 2 I 0 0 σ 2 I U T * =U * Λ (j) U T * , j = 2. Strategy 3 Sample z ∼ N (0, I), add a perturbation ∆z, and then sample x adv ∼ p(x|z + ∆z). In this strategy, x adv can be viewed as the adversarial example of x = arg max x q(z|x). x adv ∼ N (µ * + W ∆z, W W T + σ 2 I), With W W T + σ 2 I =U * Λ q 0 0 σ 2 I U T * =U * Λ (j) U T * , j = 3. In these 3 strageties, the adversarial examples can be summerized as x adv = x + W ∆z and x ∼ D j , j = 1, 2, 3, where j = 1, 2, 3 corresponding to strategy 1,2 and 3. If the data lie in a low dimensional space, i.e. the covariance matrix Σ * is rank q. Then the maximum likelihood of σ 2 M L = d i=q+1 λ i /(d -q) = 0. Then Λ (1) = Λ (2) = Λ (3) = Λ q 0 0 0 = Λ * . There is no difference among these 3 strategies and there is no change of distribution, i.e. D = D.

A.1.1 EXCESS RISK ANALYSIS

Before we prove Theorem 2. We need to prove the following lemma first. Lemma 2 (optimal perturbation). Given Θ = (µ, Σ) the optimal solution of the inner max problem in equation 6 is ∆z * = W T (LΣ -W W T ) -1 (x -µ) , where L is the lagrange multiplier satisfying ∆z * = ε. Proof: Consider problem max ∆z ≤ε (x + W ∆z, µ, Σ). The Lagrangian function is (x + W ∆z, µ, Σ) - L 2 ( ∆z 2 -ε 2 ) = d 2 log(2π) + 1 2 log |Σ| + 1 2 (x -µ + W ∆z) T Σ -1 (x -µ + W ∆z) - L 2 ( ∆z 2 -ε 2 ). Notice that this quadratic objective function is concave when L is larger than the largest eigenvalue of W T Σ -1 W . Calculate the partial derivative with respect to ∆z and set it to be zero, we have W T Σ -1 (x -µ + W ∆z * ) -L∆z * = 0 ⇔(L -W T Σ -1 W )∆z * = W T Σ -1 (x -µ) ⇔∆z * = (L -W T Σ -1 W ) -1 W T Σ -1 (x -µ) ⇔∆z * = W T (LΣ -W W T ) -1 (x -µ). The last equation comes from the Woodbury matrix inversion Lemma. We can obtain L by solving the equation ∆z * = ε. We don't have a closed form solution of L but we can solve it numerically. L → ∞ as ε → 0. We only need to know L is a constant in our whole theory. Theorem 2 (Excess risk). Let L ls and L be the loss with or without perturbation in latent space (equation 6 and 3 respectively), given the non-robustly learned Θ * = (µ * , Σ * ), The excess risk caused by perturbation is L ls (Θ * , D j ) -L(Θ * , D j ) = 1 2 q i=1 (1 + λ i -σ 2 (L -1)λ i + σ 2 ) 2 -1 λ (j) i λ i , j = 1, 2, 3 and the excess risk caused by changed of distribution is L(Θ * , D j ) -L(Θ * , D) = 1 2 log d i=1 λ (j) i d i=1 λ i + 1 2 d i=1 λ (j) i λ i -d . Proof: Since x ∼ D j = N (µ * , Σ j ) = N (µ * , U * Λ (j) U T * ). Denote v = x -µ * ∼ N (0, U * Λ (j) U T * ).

And we have

W W T = U q (Λ q -σ 2 I)U T q = U * Λ q -σ 2 I 0 0 0 U T * . The excess risk caused by perturbation is 2(L ls (Θ * , D j ) -L(Θ * , D j )) =E(v + W W T (LΣ * -W W T ) -1 v) T Σ -1 * (v + W W T (LΣ * -W W T ) -1 v) -Ev T Σ -1 * v =T r (I + W W T (LΣ * -W W T ) -1 ) T Σ -1 * (I + W W T (LΣ * -W W T ) -1 )Evv T -T r Σ -1 * Evv T =T r U * [I + (Λ q -σ 2 I)((L -1)Λ q + σ 2 I) -1 ] 2 0 0 I Λ -1 * Λ (j) U T * -T r Λ -1 * Λ (j) =T r [I + (Λ q -σ 2 I)((L -1)Λ q + σ 2 I) -1 ] 2 0 0 I Λ -1 * Λ (j) -T r Λ -1 * Λ (j) = q i=1 (1 + λ i -σ 2 (L -1)λ i + σ 2 ) 2 -1 λ (j) i λ i , j = 1, 2, 3. and the excess risk caused by changed of distribution is 2(L(Θ * , D j ) -L(Θ * , D)) = log |Σ j | -log |Σ * | + E x (x -µ * ) T Σ -1 * (x -µ * ) -E x (x -µ * ) T Σ -1 * (x -µ * ) = log |Σ j | -log |Σ * | + T r(Σ -1 * E x (x -µ * )(x -µ * ) T ) -T r(Σ -1 * E x (x -µ * )(x -µ * ) T ) = log d i=1 λ (j) i d i=1 λ i + T r(Λ -1 * Λ (j) ) -T r(Λ -1 * Λ * ) = log d i=1 λ (j) i d i=1 λ i + d i=1 λ (j) i λ i -d . It is hard to see which part dominates the excess risk. If we further assume that the data lie in a q dimension manifold. The excess risk caused by the change of distribution becomes 0. We have the following corollary. Corollary 3 (Excess risk). Let L ls and L be the loss with or without perturbation in latent space (equation ( 6) and (3) respectively), given the non-robustly learned Θ * = (µ * , Σ * ), and rank(Σ * ) = q. The excess risk L ls (Θ * , D j ) -L(Θ * , D) = O(qL -2 ). Proof: By Lemma 1, we have σ 2 = 0. λ The excess risk caused by perturbation is 2(L ls (Θ * , D j ) -L(Θ * , D j )) = q i=1 (1 + λ i -σ 2 (L -1)λ i + σ 2 ) 2 -1 λ (j) i λ i = q i=1 (1 + 1 (L -1) ) 2 -1 =O(qL -2 ). Theorem 4 (Excess risk of orginal space adversarial training). Let L r and L be the loss with or without perturbation in original space (equations ( 4) and (3) respectively), given the non-robustly learned Θ * = (µ * , Σ * ). Denote λ min be the smallest eigenvalue of Σ * . The excess risk Ω((λ min L) -2 ) ≤ L r (Θ * , D) -L(Θ * , D) ≤ O(d(λ min L) -2 ). Theorem 4 can be viewed as a corollary of Theorem 1 in Ilyas et al. (2019) . We give the prove here. Proof: Consider the Lagrange multiplier form of the inner maximization problem in equation 4. max ∆x ≤ε (x + ∆x, µ, Σ). The Lagrangian function is (x + ∆x, µ, Σ) - L 2 ( ∆x 2 -ε 2 ) = d 2 log(2π) + 1 2 log |Σ| + 1 2 (x -µ + ∆x) T Σ -1 (x -µ + ∆x) - L 2 ( ∆x 2 -ε 2 ). Notice that this quadratic objective function is concave when L is larger than the largest eigenvalue of Σ -1 . Calculate the partial derivative with respect to ∆x and set it to be zero, we have Σ -1 (x -µ + ∆x * ) -L∆x * = 0 ⇔∆x * = (LΣ -I) -1 (x -µ). The excess risk is 2(L r (Θ * , D) -L(Θ * , D)) =E(v + (LΣ * -I) -1 v) T Σ -1 * (v + (LΣ * -I) -1 v) -Ev T Σ -1 * v =T r (I + (LΣ * -I) -1 ) T Σ -1 * (I + (LΣ * -I) -1 )Evv T -T r Σ -1 * Evv T = d i=1 [(1 + 1 Lλ i -1 ) 2 -1]. On the one hand, d i=1 [(1 + 1 Lλ i -1 ) 2 -1] ≥[(1 + 1 Lλ min -1 ) 2 -1] ≥Ω((Lλ min ) -2 ). On the other hand, d i=1 [(1 + 1 Lλ i -1 ) 2 -1] ≤d[(1 + 1 Lλ min -1 ) 2 -1] ≤O(d(Lλ min ) -2 ).

A.1.2 SADDLE POINT ANALYSIS

Theorem 5 (Main result: Optimal Saddle point). The optimal solution of the modified problem in equation ( 8) is µ ls = µ * and Σ ls = U * Λ ls U T * , where λ ls i = 1 4 2λ (j) i + 4(λ i -σ 2 ) L + 2λ (j) i 1 + 4(λ i -σ 2 ) λ (j) i L for i = 1 ≤ q and λ ls i = λ i for i > q. j = 1, 2, 3 corresponding to strategies 1,2 and 3. Problem 6 is not a standard minimax problem, consider the modified problem min µ,Σ max E x ∆z =ε E x ∼D j (x + W ∆z, µ, Σ), j = 1, 2, 3. By lemma 3, the optimal perturbation ∆z * is a matrix M times x -µ. Consider the problem min µ,Σ max E x M (x -µ) =ε E x ∼D j (x + W M (x -µ), µ, Σ), j = 1, 2, 3. Lemma 6 (optimal perturbation). Given Θ = (µ, Σ) the optimal solution of the inner max problem of 10 is M * = W T (LΣ -W W T ) -1 . Proof: Consider the problem max E M (x -µ) =ε E (x + W M (x -µ), µ, Σ)). The lagrangian function is E (x + W M (x -µ), µ, Σ) - L 2 ( M (x -µ) 2 -ε 2 ) . Let x -µ = v, Take the gradient with respect to M and set it to be zero, we have ∂ ∂M E (x + W M (x -µ), µ, Σ) - L 2 ( M (x -µ) 2 -ε 2 ) =∇ M E v T M W T Σ -1 v + 1 2 v T M W T Σ -1 W M v -Lv T M M v/2 = W T Σ -1 + W T Σ -1 W M -LM E[vv T ] =0. Therefore, this is a convex problem. By the same calculation in Lemma 6, a maximizer of the inner problem is Λ * M = Λ q -σ 2 0 0 0 1/2 LΛ - Λ q -σ 2 0 0 0 -1 . Then A = I + Λ q -σ 2 0 0 0 LΛ - Λ q -σ 2 0 0 0 -1 2 . Then the first order derivative (by Daskalakis et al. (2018) ) is ∇ [T,m] T = 1 2 AΛ (j) -1 2 T -1 AT -1 m -AU T * µ * = 0. From the second equation, we directly have µ ls = µ * . From the first equation, for i > q, we have (1 + 0) 2 λ (j) i = λ ls i . For i ≤ q, we have (1 + (λ i -σ 2 )/(Lλ ls i -λ i + σ 2 )) 2 λ (j) i = λ ls i . It equivalents to a second order equation of λ ls i λ ls i 2 -λ (j) i λ ls i - λ i -σ 2 L = 0. Solving this equation, we obtained λ ls i = 1 4 2λ (j) i + 4(λ i -σ 2 ) L + 2λ (j) i 1 + 4(λ i -σ 2 ) λ (j) i L for i = 1 ≤ q and λ ls i = λ (j) i , for i > q. B EXPERIMENTS SETTINGS Proof: Consider the derivative of Problem 13. ∇ µ L = 1 n + m n i=1 Σ -1 (x i -µ) + m n + m E x Σ -1 (x -µ) = 0. Then we can obtain µ da = μ. ∇ Σ -1 L = -Σ + 1 n + m n i=1 (x i -µ)(x i -µ) T + m n + m E x (x -µ)(x -µ) T = 0. Then we have Σ ls = n n + m Ŝ + m n + m Ŝ . For i ≤ q, λ ls i = n n + m λi + m n + m λi = λi . For i > q, λ ls i = n n + m λi + m n + m σ2 = n n + m λi + m (n + m)(d -q) d k=q+1 λk . The optimal solution is a little bit destoryed. In this perspective, generative models give no help to data augumentation.



Figure 1: Demonstration of theoretical analysis: (a) Regular attacks directions; (b) Optimal saddle point of original space adversarial training; (c) Generative attacks directions; (d) Optimal saddle point of latent space adversarial training.

Figure 2: Eigenvalues of MNIST: The first row and the second row are of classes 0 and 1 respectively.The first column is about all the 784 eigenvalues of the dataset. The second column plots the large eigenvalues and the last column plots the small eigenvalues.

λ i and D j = D. Hence the excess risk caused by changed of distribution L(Θ * , D j ) -L(Θ * , D) = 0.

MNISTFor Mnist, we use LeNet5 for the classifier and 2 layers MLP (with hidden size 256 and 784) for the encoder and decoder of conditional VAE. For standard training of the classifier, we use 30 epochs, batch size 128, learning rate 10 -3 , and weight decay 5 × 10 -4 . For the CVAE, we use 20 epochs, learning rate 10 -3 , batch size 64, and latent size 10. For standard adversarial training, we use ε = 0.25 for FGSM and PGD. in PGD, we use 40 steps for the inner part. Adversarial training start after 10 epochs standard training. For generative adversarial training, we use ε = 1 in the latent space with FGSM. Adversarial training start after 10 epoches standard training. In the attack part, we use ε = 0.2 for norm-based attack and ε = 1 for generative attack on the test set. B.2 CIFAR10 For CIFAR10, we use ResNet32 for the classifier and 4 layers CNN for the encoder and decoder of conditional VAE. For standard training of the classifier, we use 200 epochs, batch size 128, learning rate 10 -3 , and weight decay 5 × 10 -4 . For the CVAE, we use 100 epochs, learning rate 10 -3 , batch size 64, and latent size 128. For standard adversarial training, we use ε = 4/255 for FGSM and PGD. in PGD, we use 10 steps for the inner part. Adversarial training start after 100 epochs standard training.For generative adversarial training, we use ε = 0.1 in the latent space with FGSM. Adversarial training start after 100 epoches standard training. Since we see that the modeling power of VAE in CIFAR10 is not good enough. For each of the image, the encode variance is very small. When we add a small perturbation to the encode mean value, the output image are blured. Hence we only use a small ε = 0.1.In the attack part, we use ε = 4/255 for norm-based attacks and ε = 0.1 for generative attack on the test set. The test accuracy of VAE-adv against VAE attack is 40.18% in our experiments. It is not good enough because of the modeling power of VAE. But the results on standard adversarial training versus VAE-attacks are worse, and vice versa. The experiments support our findings.B.3 EIGENVALUES OF COVARIANCE MATRIX OF MNISTWe plot the eigenvalues of all the classes in this section, see Figure3.C FINITE SAMPLES CASE: DATA AUGMENTATION OR ADVERSARIAL TRAININGIn this section we discuss the question that can we use the generative model to generate more examples for training in our theoretical framework. Because it is not closely related to our main results, we only discuss it in appendix. Let us focus on the case that the number of samples are not enough.Generative model cannot help data augmentation Given dataset {x i } n i=1 ⊂ R d , let μ and Ŝ = Û Λ Û be the sample mean and sample covariance matrix. The generative model learned by this dataset is x = Ûq ( Λq -σ 2 I) 1/2 z + μ + . The distribution of the data sample from the model isx gen ∼ N (μ, W W T + σ 2 I) = N (μ, Û Λgen Û ) N (μ, Ŝ ).(12)If we use n samples from the original dataset and m samples from the generative model, ; µ, Σ)+ m n + m E x∼N ( μ, Ŝ ) [ (x; µ, Σ)].(13)Theorem 8 (Data augmentation by generative model). Given MLE μ and Ŝ, The optimal solution training with n true samples and m generated samples is µ da = μ and Σ da = Û Λ da Û T , where λ da i = λi for i ≤ q and λ da i

Test accuracies of different attacks and defense algorithms on MNIST and CIFAR-10. the same directions of small variance. Hence they conflict with each other. We study the robustness trade-off between regular and generative adversarial examples in this subsection.

annex

Then we haveThe last equality is the Woodbury matrix inversion Lemma. Notice that lemma 3 and Lemma 6 have the same form of solution. This is why we can use Problem 10 to approximate Problem 6. To solve the problem 10, we need to introduce Danskin's Theorem.Theorem 7 (Danskin's Theorem). Suppose φ(x, z) : X × Z → R is a continuous function of two arguments, where Z ⊂ R m is compact. Define f (x) = max z∈Z φ(x, z). Then, if for every z ∈ Z, φ(x, z) is convex and differentiable in x, and ∂φ/∂x is continuous:The subdifferential of f (x) is given bywhere conv(•) is the convex hull, and Z 0 (x) isIf the outer minimization problem is convex and differentiable, we can use any maximizer for the inner maximization problem to find the saddle point. But the outer problem of problem 4 is not convex, we need to modify the problem again. Assume that we have already obtained the eigenvector U * from the ML estimator. The optimization variables of the outer minimization problem are µ and Λ. Then we have Σ = U * ΛU T * in problem 10. Another reason to make this assumption is that we only need to consider the eigenvalue problem to compare with standard adversarial training (Theorem 2 of Andrew Ilyas et al., 2019) .

Proof of Theorem 5:

By Lemma 6, we haveWhich is a diagonal matrixObviously, the inner constraint is compact (by Heine-Borel theorem), we only need to prove the convexity of the outer problem to use Danskin's Theorem. For any x and Λ M ,Let u = U T * (x -µ), and A = (I + Λ q -σ 2 0 0 0 Λ M ) 2 , consider the third term, we haveBy Daskalakis et al., 2018Daskalakis et al. (2018) , The hessian matrix is Proof: Considering strategy 3. In this caseIn words, the eigenvectors of the dsitribution Û is the same as the one in perturbation W ∆z. The optimization problem is the same as the one we use in the proof of Theorem 5 if we replace λ i by λi .If the data lie in a q dimensional subspace, we do not neet to make the assumption on strategy 3.

