DISENTANGLING ADVERSARIAL ROBUSTNESS IN DI-RECTIONS OF THE DATA MANIFOLD

Abstract

Using generative models (GAN or VAE) to craft adversarial examples, i.e. generative adversarial examples, has received increasing attention in recent years. Previous studies showed that the generative adversarial examples work differently compared to that of the regular adversarial examples in many aspects, such as attack rates, perceptibility, and generalization. But the reasons causing the differences between regular and generative adversarial examples are unclear. In this work, we study the theoretical properties of the attacking mechanisms of the two kinds of adversarial examples in the Gaussian mixture model. We prove that adversarial robustness can be disentangled in directions of the data manifold. Specifically, we find that: 1. Regular adversarial examples attack in directions of small variance of the data manifold, while generative adversarial examples attack in directions of large variance. 2. Standard adversarial training increases model robustness by extending the data manifold boundary in directions of small variance, while on the contrary, adversarial training with generative adversarial examples increases model robustness by extending the data manifold boundary directions of large variance. In experiments, we demonstrate that these phenomena also exist on real datasets. Finally, we study the robustness trade-off between generative and regular adversarial examples. We show that the conflict between regular and generative adversarial examples is much smaller than the conflict between regular adversarial examples of different norms.

1. INTRODUCTION

In recent years, deep neural networks (DNNs) (Krizhevsky et al. (2012) ; Hochreiter and Schmidhuber (1997) ) have become popular and successful in many machine learning tasks. They have been used in different problems with great success. But DNNs are shown to be vulnerable to adversarial examples (Szegedy et al. (2013) ; Goodfellow et al. (2014a) ). A well-trained model can be easily attacked by adding a small perturbation to the image. An effective way to solve this issue is to train the robust model training methods. This study, as far as we know, has not been done before. Specifically, we consider a generative attack method that adds a small perturbation in the latent space of the generative models. Contributions: We study the theoretical properties of latent space adversarial training and standard adversarial training in the Gaussian mixture model with a linear generator. We give the excess risk analysis and saddle point analysis in this model. Based on this case study, we claim that: • Regular adversarial examples attack in directions of small variance of the data manifold and leave the data manifold. • Standard adversarial training increases the model robustness by amplifying the small variance. Hence, it extends the boundary of the data manifold in directions of small variance. • Generative adversarial examples attack in directions of large variance of the data manifold and stay in the data manifold. 



using training data augmented with adversarial examples, i.e. adversarial training. With the growing success of generative models, researchers have tried to use generative adversarial networks (GAN) (Goodfellow et al. (2014b)) and variational autoencoder (VAE) (Kingma and Welling (2013)) to generate adversarial examples (Xiao et al. (2018); Zhao et al. (2017); Song et al. (2018a); Kos et al. (2018); Song et al. (2018b)) to fool the classification model with great success. They found that standard adversarial training cannot defend these new attacks. Unlike the regular adversarial examples, these new adversarial examples are perceptible by humans but they preserve the semantic information of the original data. A good DNN should be robust to such semantic attacks. Since the GAN and VAE are approximations of the true data distribution, these adversarial examples will stay in the data manifold. Hence they are called on-manifold adversarial examples by (Stutz et al. (2019)). On the other hand, experimental evidences support that regular adversarial examples leave the data manifold (Song et al. (2017)). We call the regular adversarial examples as off-manifold adversarial examples. The concepts of on-manifold and off-manifold adversarial examples are important. Because they can help us to understand the issue of conflict between adversarial robustness and generalization (Stutz et al. (2019); Raghunathan et al. (2019)), which is still an open problem. In this paper, we study the attacking mechanisms of these two types of examples, as well as the corresponding adversarial

Since standard adversarial training cannot defend this attack, we consider the training methods that use training data augmented with these on-manifold adversarial examples, which we call latent space adversarial training. Then we compare it to standard adversarial training (training with off-manifold adversarial examples).

Latent space adversarial training increases the model robustness by amplifying the large variance. Hence, it extends the boundary of the data manifold in directions of large variance. We provide experiments on MNIST and CIFAR-10 and show that the above phenomena also exist in real datasets. It gives us a new perspective to understand the behavior of on-manifold and off-manifold adversarial examples. Finally, we study the robustness trade-off between generative and regular adversarial examples. On MNIST, robustness trade-off is unavoidable, but the conflict between generative adversarial examples and regular adversarial examples are much smaller than the conflict between regular adversarial examples of different norms. On CIFAR-10, there is nearly no robustness trade-off between generative and regular adversarial examples. Ilyas et al. (2018)), the attackers have limited access to the model. First order optimization methods, which use the gradient information to craft adversarial examples, such as PGD (Madry et al. (2017)), are widely used for white box attack. Zeroth-order optimization methods (Chen et al. (2017)) are used in black box setting. Li et al. (2019) improved the query efficiency in black-box attack. HopSkipJumpAttack (Chen et al. (2020)) is another query-efficient attack method. Defense with generative model Using generative models to design defense algorithms have been studied extensively. Using GAN, we can project the adversarial examples back to the data manifold

