A FREQUENCY DOMAIN ANALYSIS OF GRADIENT-BASED ADVERSARIAL EXAMPLES

Abstract

It is well known that deep neural networks are vulnerable to adversarial examples. We attempt to understand adversarial examples from the perspective of frequency analysis. Several works have empirically shown that the gradient-based adversarial attacks perform differently in the low-frequency and high-frequency part of the input data. But there is still a lack of theoretical justification of these phenomena. In this work, we both theoretically and empirically show that the adversarial perturbations gradually increase the concentration in the low-frequency domain of the spectrum during the training process of the model parameters. And the log-spectrum difference of the adversarial examples and clean image is more concentrated in the high-frequency part than the low-frequency part. We also find out that the ratio of the high-frequency and the low-frequency part in the adversarial perturbation is much larger than that in the corresponding natural image. Inspired by these important theoretical findings, we apply low-pass filter to potential adversarial examples before feeding them to the model. The results show that this preprocessing can significantly improve the robustness of the model.

1. INTRODUCTION

Recently, deep neural networks (DNN) have achieved great success in the field of image processing, but it was found that DNNs are vulnerable to some synthetic data called adversarial examples ((Szegedy et al., 2013) , (Kurakin et al., 2016)) . Adversarial examples are natural samples plus adversarial perturbations, and the perturbations between natural samples and adversarial examples are imperceptible to human but able to fool the model. Typically, generating an adversarial example can be considered as finding an example in an -ball around a natural image that could be misclassified by the classifier. Recent studies designed Fast Gradient Sign Method (FGSM, (Goodfellow et al., 2014) ), Fast Gradient Method (FGM, (Miyato et al., 2016) ), Projected Gradient Descent (PGD, (Madry et al., 2017) ) and other algorithms (Carlini & Wagner (2017 ), (Su et al., 2019 ), (Xiao et al., 2018) , (Kurakin et al., 2016 ), (Chen et al., 2017) ) to attack the model. Since the phenomenon of adversarial examples was discovered, many related works have made progress to study why they exist. Several works studied this phenomenon from the perspective of feature representation. Ilyas et al. (2019) divided the features into non-robust ones that are responsible for the model's vulnerability to adversarial examples, and robust ones that are close to human perception. Further, they showed that adversarial vulnerability arises from non-robust features that are useful for correct classification. 2019) found that when perturbations are constrained to the low-frequency subspace, they are generated faster and are more transferable, and will be effective to fool the defended models, but not for clean models. All of these works showed that spectrum in frequency domain is an reasonable way to study the adversarial examples. However, there is a lack of theoretical understanding about the dynamics of the adversarial perturbations in frequency domain along the training process of the model parameters. In this work, we focus on the frequency domain of the adversarial perturbations to explore the spectral properties of the adversarial examples generated by FGM (Miyato et al., 2016) and PGD (Madry et al., 2017) . In this work, we give a theoretical analysis in frequency domain of natural images for the adversarial examples. • For a two-layer neural network with non-linear activation function, we prove that adversarial perturbations by FGM and l 2 -PGD attacks gradually increase the concentration in the low-frequency part of the spectrum during the training process over model parameters. • Meanwhile, the log-spectrum difference of the adversarial examples (the definition will be clarified in Section 2.2) will be more concentrated in the high-frequency part than the low-frequency part. • Furthermore, we show that the ratio of the high-frequency and the low-frequency part in the adversarial perturbation is much larger than that in the corresponding clean image. Empirically, • we design several experiments on the two-layer model and Resnet-32 with CIFAR-10 to verify the above findings. • Based on these phenomena, we filter out the high-frequency part of the adversarial examples before feeding them to the model to improve the robustness. Compared with the adversarially trained model with the same architecture, our method achieves comparable robustness with the similar computational cost of normal training and almost no loss of accuracy. The rest of the paper is organized as follows: In Section 2, we present preliminaries and some theoretical analysis on the calculation of the spectrum and log-spectrum. Then we provide our main results about the frequency domain analysis of gradient-based adversarial examples in Section 3. Furthermore, some experiments for supporting our theoretical findings are shown in Section 4. Finally we conclude our work and conduct a discussion about the future work in Section 5. All the details about the proof and experiments are shown in the Appendix.

2. BACKGROUND

2.1 PRELIMINARIES Notations We use {0, 1, ..., d} to denote the set of all integers between 0 and d and use • p to denote l p norm. Specifically, we denote by • the l 2 norm. For a d-dimensional vector x, we use x µ to denote its µ-th component with index starting at 0. For a scalar function f (x): R d → R, ∇ x f and ∂ µ f denote the gradient vector and its µ-th component. We let sgn(x) = 1 for x > 0 and -1 for x < 0. Normal training refers to training on the original training data for learning the optimal weights for the neural networks. x = F(x) denotes the Discrete Fourier Transform (DFT) of x. Discrete Fourier Transform The k-th frequency component g[k] of one-dimensional DFT of a vector g is defined by g[k] = F(g)[k] = d-1 µ=0 g µ e ik 2π d µ , where k ∈ {-d/2, -d/2 + 1, ..., d/2} if d is even and k ∈ {-(d -1)/2, ..., (d -1)/2} if d is odd. For convenience, we always consider an odd d and one can easily generalize the case to even dimensions. For an integer cut-off (cutoff frequency) k r ∈ (0, (d -1)/2), the Low Frequency Components (LFC) and High Frequency Components (HFC) of g are denoted by g[k] otherwise. gl [k] = g[k] if -k c ≤ k ≤ k c , 0 otherwise gh [k] = 0 if -k c ≤ k ≤ k c ,



Another way to characterize adversarial examples is to investigate them in the frequency domain via Fourier transform. Wang et al. (2020a) divided an image into low-frequency component (LFC) and high-frequency component (HFC) and they empirically showed that human can only perceive LFC, but convolutional neural networks can obtain useful information from both LFC and HFC. Yin et al. (2019) filters out the input data with low-pass or high-pass filters to study the sensitivity of the additive noise with different frequencies. Wang et al. (2020b) claimed that existing adversarial attacks mainly concentrate in the high-frequency part. Sharma et al. (

