A FREQUENCY DOMAIN ANALYSIS OF GRADIENT-BASED ADVERSARIAL EXAMPLES

Abstract

It is well known that deep neural networks are vulnerable to adversarial examples. We attempt to understand adversarial examples from the perspective of frequency analysis. Several works have empirically shown that the gradient-based adversarial attacks perform differently in the low-frequency and high-frequency part of the input data. But there is still a lack of theoretical justification of these phenomena. In this work, we both theoretically and empirically show that the adversarial perturbations gradually increase the concentration in the low-frequency domain of the spectrum during the training process of the model parameters. And the log-spectrum difference of the adversarial examples and clean image is more concentrated in the high-frequency part than the low-frequency part. We also find out that the ratio of the high-frequency and the low-frequency part in the adversarial perturbation is much larger than that in the corresponding natural image. Inspired by these important theoretical findings, we apply low-pass filter to potential adversarial examples before feeding them to the model. The results show that this preprocessing can significantly improve the robustness of the model.

1. INTRODUCTION

Recently, deep neural networks (DNN) have achieved great success in the field of image processing, but it was found that DNNs are vulnerable to some synthetic data called adversarial examples ((Szegedy et al., 2013 ), (Kurakin et al., 2016)) . Adversarial examples are natural samples plus adversarial perturbations, and the perturbations between natural samples and adversarial examples are imperceptible to human but able to fool the model. Typically, generating an adversarial example can be considered as finding an example in an -ball around a natural image that could be misclassified by the classifier. Recent studies designed Fast Gradient Sign Method (FGSM, (Goodfellow et al., 2014) ), Fast Gradient Method (FGM, (Miyato et al., 2016) ), Projected Gradient Descent (PGD, (Madry et al., 2017) ) and other algorithms (Carlini & Wagner (2017 ), (Su et al., 2019) , (Xiao et al., 2018) , (Kurakin et al., 2016) 2019) found that when perturbations are constrained to the low-frequency subspace, they are generated faster and are more transferable, and will be effective to fool the defended models, but not for clean models. All of these works showed that spectrum in frequency domain is an reasonable way to study the adversarial examples.



, (Chen et al., 2017)) to attack the model. Since the phenomenon of adversarial examples was discovered, many related works have made progress to study why they exist. Several works studied this phenomenon from the perspective of feature representation. Ilyas et al. (2019) divided the features into non-robust ones that are responsible for the model's vulnerability to adversarial examples, and robust ones that are close to human perception. Further, they showed that adversarial vulnerability arises from non-robust features that are useful for correct classification. Another way to characterize adversarial examples is to investigate them in the frequency domain via Fourier transform. Wang et al. (2020a) divided an image into low-frequency component (LFC) and high-frequency component (HFC) and they empirically showed that human can only perceive LFC, but convolutional neural networks can obtain useful information from both LFC and HFC. Yin et al. (2019) filters out the input data with low-pass or high-pass filters to study the sensitivity of the additive noise with different frequencies. Wang et al. (2020b) claimed that existing adversarial attacks mainly concentrate in the high-frequency part. Sharma et al. (

