A FREQUENCY DOMAIN ANALYSIS OF GRADIENT-BASED ADVERSARIAL EXAMPLES

Abstract

It is well known that deep neural networks are vulnerable to adversarial examples. We attempt to understand adversarial examples from the perspective of frequency analysis. Several works have empirically shown that the gradient-based adversarial attacks perform differently in the low-frequency and high-frequency part of the input data. But there is still a lack of theoretical justification of these phenomena. In this work, we both theoretically and empirically show that the adversarial perturbations gradually increase the concentration in the low-frequency domain of the spectrum during the training process of the model parameters. And the log-spectrum difference of the adversarial examples and clean image is more concentrated in the high-frequency part than the low-frequency part. We also find out that the ratio of the high-frequency and the low-frequency part in the adversarial perturbation is much larger than that in the corresponding natural image. Inspired by these important theoretical findings, we apply low-pass filter to potential adversarial examples before feeding them to the model. The results show that this preprocessing can significantly improve the robustness of the model.

1. INTRODUCTION

Recently, deep neural networks (DNN) have achieved great success in the field of image processing, but it was found that DNNs are vulnerable to some synthetic data called adversarial examples ( (Szegedy et al., 2013) , (Kurakin et al., 2016)) . Adversarial examples are natural samples plus adversarial perturbations, and the perturbations between natural samples and adversarial examples are imperceptible to human but able to fool the model. Typically, generating an adversarial example can be considered as finding an example in an -ball around a natural image that could be misclassified by the classifier. Recent studies designed Fast Gradient Sign Method (FGSM, (Goodfellow et al., 2014) ), Fast Gradient Method (FGM, (Miyato et al., 2016) ), Projected Gradient Descent (PGD, (Madry et al., 2017) ) and other algorithms (Carlini & Wagner (2017) , (Su et al., 2019) , (Xiao et al., 2018) , (Kurakin et al., 2016) , (Chen et al., 2017) ) to attack the model. Since the phenomenon of adversarial examples was discovered, many related works have made progress to study why they exist. Several works studied this phenomenon from the perspective of feature representation. Ilyas et al. (2019) divided the features into non-robust ones that are responsible for the model's vulnerability to adversarial examples, and robust ones that are close to human perception. Further, they showed that adversarial vulnerability arises from non-robust features that are useful for correct classification. Another way to characterize adversarial examples is to investigate them in the frequency domain via Fourier transform. Wang et al. (2020a) divided an image into low-frequency component (LFC) and high-frequency component (HFC) and they empirically showed that human can only perceive LFC, but convolutional neural networks can obtain useful information from both LFC and HFC. Yin et al. (2019) filters out the input data with low-pass or high-pass filters to study the sensitivity of the additive noise with different frequencies. Wang et al. (2020b) claimed that existing adversarial attacks mainly concentrate in the high-frequency part. Sharma et al. (2019) found that when perturbations are constrained to the low-frequency subspace, they are generated faster and are more transferable, and will be effective to fool the defended models, but not for clean models. All of these works showed that spectrum in frequency domain is an reasonable way to study the adversarial examples. However, there is a lack of theoretical understanding about the dynamics of the adversarial perturbations in frequency domain along the training process of the model parameters. In this work, we focus on the frequency domain of the adversarial perturbations to explore the spectral properties of the adversarial examples generated by FGM (Miyato et al., 2016) and PGD (Madry et al., 2017) . In this work, we give a theoretical analysis in frequency domain of natural images for the adversarial examples. • For a two-layer neural network with non-linear activation function, we prove that adversarial perturbations by FGM and l 2 -PGD attacks gradually increase the concentration in the low-frequency part of the spectrum during the training process over model parameters. • Meanwhile, the log-spectrum difference of the adversarial examples (the definition will be clarified in Section 2.2) will be more concentrated in the high-frequency part than the low-frequency part. • Furthermore, we show that the ratio of the high-frequency and the low-frequency part in the adversarial perturbation is much larger than that in the corresponding clean image. Empirically, • we design several experiments on the two-layer model and Resnet-32 with CIFAR-10 to verify the above findings. • Based on these phenomena, we filter out the high-frequency part of the adversarial examples before feeding them to the model to improve the robustness. Compared with the adversarially trained model with the same architecture, our method achieves comparable robustness with the similar computational cost of normal training and almost no loss of accuracy. The rest of the paper is organized as follows: In Section 2, we present preliminaries and some theoretical analysis on the calculation of the spectrum and log-spectrum. Then we provide our main results about the frequency domain analysis of gradient-based adversarial examples in Section 3. Furthermore, some experiments for supporting our theoretical findings are shown in Section 4. Finally we conclude our work and conduct a discussion about the future work in Section 5. All the details about the proof and experiments are shown in the Appendix.

2.1. PRELIMINARIES

Notations We use {0, 1, ..., d} to denote the set of all integers between 0 and d and use • p to denote l p norm. Specifically, we denote by • the l 2 norm. For a d-dimensional vector x, we use x µ to denote its µ-th component with index starting at 0. For a scalar function f (x): R d → R, ∇ x f and ∂ µ f denote the gradient vector and its µ-th component. We let sgn(x) = 1 for x > 0 and -1 for x < 0. Normal training refers to training on the original training data for learning the optimal weights for the neural networks. x = F(x) denotes the Discrete Fourier Transform (DFT) of x.

Discrete Fourier Transform

The k-th frequency component g[k] of one-dimensional DFT of a vector g is defined by g[k] = F(g)[k] = d-1 µ=0 g µ e ik 2π d µ , where k ∈ {-d/2, -d/2 + 1, ..., d/2} if d is even and k ∈ {-(d -1)/2, ..., (d -1)/2} if d is odd. For convenience, we always consider an odd d and one can easily generalize the case to even dimensions. For an integer cut-off (cutoff frequency) k r ∈ (0, (d -1)/2), the Low Frequency Components (LFC) and High Frequency Components (HFC) of g are denoted by gl [k] = g[k] if -k c ≤ k ≤ k c , 0 otherwise gh [k] = 0 if -k c ≤ k ≤ k c , g[k] otherwise. Then the frequency space R d can be decomposed as S l S h where gl ∈ S l and gh ∈ S h . Let N g = d/2 k=-d/2 |g[k]| 2 , then the ratio of a frequency component with respect to the whole frequency spectrum is denoted by τ (g[k]) = |g[k]| 2 N 2 g and the ratio of LFC is denoted by τ (g ∈ S l ) = kc k=-kc |g[k]| 2 N 2 g . Setup In this paper, we consider a two-layer neural network f : R d → R f (x, θ) = m r=1 a r σ(w T r x) (1) where σ is an activation function, a = (a 1 , a 2 , ..., a m ) T is an m-dimensional vector and w r 's are d-dimensional weight vectors. Van der Schaaf & van Hateren (1996) showed that natural images obey a power law in the frequency domain. Therefore, to make our settings closer to widely studied image related tasks, we also assume our input x obey the power law | x[k]| ∝ k -α to imitate LFC-concentrated images, where the constant α ≥ 1 and | x[0]| = α 0 is another constant. Besides, we choose the cut-off k c in such a way that kc k=1 1 k 2α + α 2 0 > (d-1)/2 k=kc+1 1 k 2α , which indicates that τ ( x ∈ S l ) > τ ( x ∈ S h ) and can be easily satisfied for a small k c . Gradient-based adversarial attack For a loss function , we find l p -norm -bounded adversarial perturbations by solving the optimization problem max δ p ≤ (f (x + δ), y), where (x, y) ∈ R d × R is d-dimensional input and output following the joint distribution D. To solve the above optimization problem, Goodfellow et al. (2014) proposed FGSM, an attack for an l ∞ -bounded adversary, and computed an adversarial example as x adv = x + sgn (∇ x (θ, x, y)).  = Π x+S (x t + α sgn (∇ x (θ, x, y))) or x t+1 = Π x+S (x t + α ∇x (x,y) ∇x (x,y) 2 ) To simplify the analysis, we make the following assumptions on the normal training process: Assumption 1 There exist β , β > 0 such that β ≤ |∂ /∂f | ≤ β. Assumption 2 If = 1 2 (y -f (x, θ)) foot_1 , |∂ /∂f | should be a small quantity such that |∂ /∂f | = 1+v if the network is well trained in the end stage of normal training, where 0 < v < 1. Assumption 3 There exist λ, γ > 0 such that 0 ≤ σ ≤ λ and 0 ≤ σ ≤ γfoot_0 . Assumption 4 a r 's are i.i.d drawn from a distribution 2 and will not be updated during the training. Wang et al. (2020a) considered the log-spectrum of an image: 20 * log (|F(x)|) , where x is a 2ddimensional matrix representing an image. One way to calculate the "log-spectrum difference of the adversarial examples" is to consider the difference between the log-spectrum of the adversarial and natural image and as follows,

2.2. EXPLORING THE LOG-SPECTRUM AND SPECTRUM

20 * |log(|F(x + δ)|) -log(|F(x)|)| ≈ 20 * log(1 + |F(δ)| |F(x)| ) ≈ 20 * |F(δ)| |F(x)| . We can observe that this quantity is approximately proportional to the ratio of the perturbation's over the natural image's frequency. Moreover, the average Relative Change in discrete cosine Transforms (RCT) in (Wang et al., 2020b) ) approximately equals to it. Fig. 5 empirically shows that this ratio is large for high frequency component but small for low frequency one. However, Van der Schaaf & van Hateren (1996) claimed that the LFC of F(x) is much bigger than its HFC, and the right side of Eq. 3 is inversely proportional to F(x). The denominator of the right side of Eq.( 3) in low-frequency part is much larger than it in high frequency part, so it may be imprudent to consider the frequency distribution of the adversarial perturbations only based on the ratio in the right side of Eq.( 3). In this work, we consider the two-layer neural networks and the loss function = 1 2 (y -f (x, θ)) 2 . For any x in this setting, we have ∂f ∂a r = σ(w T r x), ∂f ∂w r = a r x, ∇ x f = r a r w r , where a r = a r σ (w T r x) and a = (a 1 , a 2 , ..., a m ). At the (t + 1)-th step of gradient descent in the normal training process of the classifier, weight w is updated as w (t+1) r = w (t) r -η ∂ ∂f (t) a (t) r x. We use L x and H x to denote the LFC and HFC of the clean input L x = kc k=0 | x[k]| 2 and H x = (d-1)/2 k=kc+1 | x[k]| 2 . (5) 3.1 SPECTRAL TRAJECTORIES OF l 2 -NORM FGM PERTURBATIONS Consider the l 2 -norm FGM perturbation δ and its DFT δ[k] δ = ∂ ∂f ∇xf ∂ ∂f ∇xf , δ[k] = sgn ∂ ∂f 1 ∇xf ∇f [k], where 6) is same for different frequencies k, we only need to analyze the spectrum of ∇f [k] to investigate the ratio of certain frequency component over the whole frequency spectrum, i.e. ∇f [k] = F (∇ x f ) [k]. Since the coefficient before ∇f [k] in ( τ δ[k] = τ ∇f [k] . In this subsection, we explore the evolving of τ ( δ[k]) along the normal training process. For a randomly initialized network, the FGM-attack perturbation will not have bias towards either HFC or LFC. We attack this model with FGM at each step of the normal training to study the trajectory of these perturbations in the frequency domain. According to Eq.( 4), ∇ x f is updated to the order of η at the (t + 1)-th step as ∇ x f (t+1) = m r=1 (1 -η (t) r )a (t) r w (t) r -η(t) x + O(η 2 ), where η ) 2 are composite learning rates for easing our notation. For ReLU activation function σ(x) = max(0, x), we have that η (t) r = ηa r x 2 ∂ ∂f (t) σ (t) (w (t)T r x) and η(t) = η ∂ ∂f (t) a (t (t) r = E i∈{1,2,...,m} [η (t) i ], therefore, considering that Softplus has similar shape as ReLU, we derive that η (t) r = η(t) + O(η 2 ), where η(t) = η x 2 ∂ ∂f (t) E r∈{1,2,...,m} a r σ (t) (w (t)T r x) . In this way, the 1d-DFT of ∇ x f (t+1) has the following form, ∇f (t+1) [k] = (1 -η(t) ) ∇f (t) [k] -η(t) x[k]. We are now ready to study the trajectory of τ ( δ[k]) along the training step t. Let ϕ (t) k denote the difference of phases between ∇f (t) [k] and x[k] for frequency k at the t-th step. And for ease of notation, denote τ (t) l -2 kc k=0 | ∇f (t) [k]| | x[k]| cos ϕ (t) k , τ (t) h -2 (d-1)/2 k=kc+1 | ∇f (t) [k]| | x[k]| cos ϕ (t) k , where η(t) τ (t) l and η(t) τ (t) h are approximately changed amounts for LFC and HFC of | ∇f (t) | 2 at the t-th step of the normal training of the network, respectively. Since the network is randomly initialized and the clean inputs x are highly concentrated in the low frequency domain, we always study the case where τ ( δ(0) ∈ S l ) < τ ( x ∈ S l ). We now, taking Eq.( 7) into consideration , present our main theorem about l 2 norm FGM perturbations as follows. Theorem 1 (The spectral trajectory of l 2 FGM perturbation) During the training process of the two-layer neural network f (x) in (1), the l 2 norm FGM adversarial perturbation will change its ratio of LFC, τ ( δ ∈ S l ), at the (t + 1)-th step as follows, τ ( δ(t+1) ∈ S l ) = τ ( δ(t) ∈ S l ) + η(t) τ (t) l τ δ(t) ∈ S h -τ (t) h τ δ(t) ∈ S l (d-1)/2 k=0 | ∇f (t+1) [k]| 2 ; (9) there will be a t 1 s.t. ∀t ≥ t 1 , we have τ ( δ(t+1) ∈ S l ) > τ ( δ(t) ∈ S l ). ( ) Remark According to theorem 1, during the normal training process for a randomly initialized network with LFC-concentrated input, there will be a t 0 s.t. ∀t ≥ t 0 , we have τ (t) l > τ (t) h , and τ δ(t+1) ∈ S l > τ δ(t) ∈ S l if τ δ(t) ∈ S l < τ δ(t) ∈ S h . For ReLU activa- tion function, starting with τ ( δ(0) ∈ S l ) = 1/2 -ζ and τ (0) l = τ (0) h + bζ, we have Hx) , where n = min t∈[0,t0] a (t) 2 . Besides, given such a network trained with at least t 0 steps, there will be a t 1 > t 0 such that τ t 0 = max 0, - bζ nηβ (Lx- (t1) l τ δ(t1) ∈ S h ≥ τ (t1) h τ δ(t1) ∈ S l and FGM perturbation will increase its ratio of LFC for all t > t 1 no matter whether it is smaller than 1/2. Finally, if τ δ(t2) ∈ S l > τ δ(t2) ∈ S h for some t 2 > t 0 , this relation will hold for any t ≥ t 2 .

3.2. HIGH-FREQUENCY CONCENTRATION FOR LOG-SPECTRUM DIFFERENCE OF ADVERSARIAL EXAMPLES

In this subsection, we explore spectral trajectories of log-spectrum difference of l 2 FGM adversarial examples Theorem 2 (Concentration of the log spectrum difference) For l 2 FGM perturbations of the network in (1), for any k > k > 0 which satisfies that R R (t) k = ∇f (t) [k] | x[k]| ∝ δ(t) [k] | x[k]| along normal (0) k > R (0) k , there exists a t such that R (t) k > R (t) k for all t < t steps of the normal training. For ReLU activation function, we have t = R (0) k cos( ϕ (0) k ) -R (0) k cos( ϕ (0) k ) ηβ a 2 , ( ) while if the initialization satisfies that R (0) k cos( ϕ (0) k ) ≤ R (0) k cos( ϕ (0) k ) then t = ∞. The theorem shows that although δ itself can concentrate in the low frequency domain, it does not concentrate as dense in the low frequency domain as the inputs x which obey the power law. Therefore, the log-spectrum difference (i.e. equivalent to spectrum of the ratio of the perturbation to the clean data) of adversarial examples will express a HFC-concentrated phenomenon instead (Fig. 1 ).

3.3. MASKING HFC OF THE ADVERSARIAL EXAMPLES TO IMPROVE ROBUSTNESS

In this subsection, we compare the ratio of LFC of the adversarial perturbations τ ( δ ∈ S l ) and the ratio of LFC of original images τ ( x ∈ S l ). Theorem 3 (Comparison on the LFC ratio of clean data and perturbations) For l 2 FGM perturbation of the two-layer neural network in (1), at the t-th step of normal training, let ζ = τ ( x ∈ S l ) -τ ( δ(0) ∈ S l ) > 0, then we will have τ ( δ(t) ∈ S l ) < τ ( x ∈ S l ) for all t > 0 if the initilization of the netowrk satisfies that τ (0) l < τ ( x ∈ S l )( τ (0) l + τ (0) h ). ( ) Remark When both τ (0) l and τ (0) h are positive, condition (12) states that the ratio of low frequency component of τ (0) l + τ (0) h is smaller than that of clean data, which can be easily satisfied for a randomly initialized network. The ratio of LFC of the perturbations δ will be less than that of clean data x if the condition Eq.( 12) is satisfied for l 2 FGM perturbations during the normal training process. Masking the HFC of an adversarial example x + δ will then erase more amount of the perturbation than that of the clean data which leads to a new "not-perturbed-much" adversarial example. As a result, one can expect the improvement for robustness of the model, as demonstrated in Section 4.3 later.

4. EXPERIMENT

In this section, we conduct several experiments to validate our theoretical conclusions. Setup. We use the Resnet-32 (He et al., 2016) , a fully connected network as the last layer and cross entropy loss, and train all the parameters on the CIFAR-10 dataset with Adam optimizer (Kingma & Ba, 2014) . We use PGD-attack with = 8/255, 40 iterations and step size ξ = 4/255. We supplement more experiments in Appendix C to fill the gap between our theory and more complicated settings: we use the MSE loss and keep other conditions unchanged to resolve the discrepancy of loss function; experiments to support our theorems on two-layer neural network with fixed outer-layer parameters are also provided.

4.1. THE INCREASE OF LFC CONCENTRATION

In this part, we empirically verify the theorem in Section 3.1, to show that adversarial perturbations gradually concentrate more in the low-frequency part of the spectrum during the training process. Consider a neural network with random initialization, each time before updating all the parameters of neural network, we randomly sample 100 images and apply the l 2 FGM / PGD attack to them. For each iteration in PGD attack, we calculate their ratios of LFC and plot them in Fig. 2 . It shows that the LFC ratios of the perturbations τ ( δ ∈ S l ) are gradually increasing along the normal training. For a sufficiently trained neural network, we choose an image, use PGD-attack to obtain its adversarial examples and perturbations, and calculate its spectrum. We randomly select 10 images in the test set and show the original images and the spectrum of perturbations in Fig. 3 . We also randomly select 1000 images in the test set and average their spectrum to show the expectation of the spectrum in Fig. 

4.2. HFC CONCENTRATION OF LOG-SPECTRUM DIFFERENCE

We first use PGD-attack to get the adversarial example a image for a sufficiently trained neural network. We then make the log-spectrum of the original image and its adversarial example and calculate the difference of their log-spectrum (Fig. 5 ). The procedure of our low-pass filter basically has three steps: performing DFT on the inputs, then masking their high frequency components (set as 0) in the frequency domain and finally performing inverse DFT on them to give the desired inputs with only low frequency components preserved. We evaluate two types of accuracy. One is filtering the clean images out of high frequencies and feed them into the model to test accuracy, named as "no-attack accuracy". The other is firstly PGD-attacking the input images in the normally trained model, and then check the test accuracy of filtered adversarial images, called "PGD-attack accuracy". Here, we use data augmentation to train the model to the best. We show the results in Fig. 6 . As the cut-off increases, the no-attack accuracy gradually increases, and the PGD-attack accuracy first increases and then decreases. The highest robustness achieves 51.7% when the cut-off k = 12 and its robustness accuracy is 81.7%. And the normal accuracy (no filter, no attack) is 93.03%. We also adversarially train the model with PGD-attacks. Its normal accuracy is 79% and PGD-attack accuracy is 48% without the filtering. Our low-pass filter model not only performs better on both no-attack accuracy and PGD-attack accuracy, but also does not need computationally expensive adversarial training. Theorem 3 provides a reasonable explanation for the above results. When the cut-off is low, the filter masks most of the original image and the adversarial perturbation, so both of the no-attack and PGD-attack models perform not well. When the cut-off is proper, there remains enough information of original images but little of perturbations, so the PGD attacker cannot fool the model thoroughly. Therefore, with a suitable cut-off, the low-pass filter can indeed improve the model's robustness without computationally expensive adversarial training.

5. CONCLUSION

We investigate the adversarial examples through the lens of frequency analysis. Our work both theoretically and empirically clarifies the definition of log-spectrum difference of the adversarial examples in existing literature. Besides, we devote to understanding the spectral trajectories of adversarial examples: l 2 FGM perturbations gradually increase the concentration in the low-frequency part of the spectrum but their ratios of LFC will never exceed that of clean data during the training process over model parameters. Inspired by these findings, we find that a low-pass filter can improve the robustness of the model and give a reasonable explanation for this phenomenon. Future work can focus on analysis of other perturbations in the frequency domain such as FGSM and l ∞ PGD perturbations or providing theoretical understanding of the phenomenon that the ratio of the LFC of the perturbation for adversarially trained model is much higher than that for normally trained model (Appendix D).

A PROOFS

A.1 PROOF OF THEOREM 1 According to (8), the update of τ ( δ[k]) for k-th frequency component of l 2 norm FGM adversarial perturbation at the (t + 1)-th step is proportional to (1 -2η (t) 2 )| ∇f (t) [k]| 2 -2η (t) | ∇f (t) [k]|| x[k]| cos ϕ (t) k . ( ) We adopt the following representations to see trends of cos ϕ (t) k through the training process: ∇f (t) [k] : = -→ c (t) = (|c (t) | cos θ, |c (t) | sin θ) x[k] : = --→ x (k) = (|x (k) | cos ζ, |x (k) | sin ζ), then | ∇f (t+1) [k]|| x[k]| cos ϕ (t) k will have the form of ---→ c (t+1) , --→ x (k) = (1 -η(t) ) -→ c (t) , --→ x (k) -η(t) |x (k) | 2 + O(η 2 ) at the (t + 1)-th step. At the (t + 1)-th step, we have τ (t+1) l -τ (t+1) h = (1 -η(t) )( τ (t) l -τ (t) h ) + η(t) (L x -H x ), which means that if τ (t) l > τ (t) h then we must have τ (t ) l > τ (t ) h for all t ≥ t. Besides, there must be a step which leads to this condition since L x > H x . Let L (t) = kc k=0 | ∇f [k]| 2 and H (t) = (d-1)/2 k=kc+1 | ∇f [k]| 2 . At the (t + 1)-th step, the changed amount of τ ( δ ∈ S l ) is (1 -2η (t) )L (t) + η(t) τ (t) l (1 -2η (t) )(L (t) + H (t) ) + η(t) ( τ (t) l + τ (t) h ) - L (t) L (t) + H (t) =η (t) τ (t) l τ δ(t) ∈ S h -τ (t) h τ δ(t) ∈ S l (d-1)/2 k=0 | ∇f (t+1) [k]| 2 . ( ) If t > t 0 and τ δ(t) ∈ S h > τ δ(t) ∈ S l then the above changed amount will be positive and gradient descent will increase τ δ ∈ S l at this step. If τ (t) l τ δ(t) ∈ S h < τ (t) h τ δ(t) ∈ S l , then gradient descent will increase τ δ(t) ∈ S h untill the step t 1 at which τ (t1) l τ δ(t1) ∈ S h ≥ τ (t1) h τ δ(t1) ∈ S l , then for any t > t 1 , we have τ δ(t+1) ∈ S l > τ δ(t) ∈ S l . A.2 PROOF OF THEOREM 2 We provide the proof for ReLU activation function for positive ẽ ta, the case for negative η is similar. The update rules of R k at the (t + 1)-th step are R (t+1)2 k = R (t)2 k -2η (t) R (t) k cos ϕ (t) k R (t) k cos ϕ (t) k = R (t-1) k cos ϕ (t-1) k -η(t-1) . For any k > k > 0 with R (0) k > R (0) k , we have R (t+1)2 k -R (t+1)2 k = R (t)2 k -R (t)2 k -2η (t) R (t) k cos ϕ (t) k -R (t) k cos ϕ (t) k = R (0)2 k -R (0)2 k -2 t t =0 η(t ) R (0) k cos ϕ (0) k -R (0) k cos ϕ (0) k at the (t + 1)-th step. If R (0) k cos ϕ (0) k -R (0) k cos ϕ (0) k > 0, considering the case η(t ) = ηmax for any t and the condition that the left L.H.S of the above equation is larger than 0 gives us Eq.( 11). On the other hand, if R (0) k cos ϕ (0) k -R (0) k cos ϕ (0) k < 0, then R (t) k > R (t) k for all t ≥ 0.

A.3 PROOF OF THEOREM 3

We consider the case for l 2 FGM perturbations with η > 0, the case for negative η is similar. Let the initialization of the network always satisfy the condition L (0) L (0) + H (0) = τ ( δ(0) ∈ S l ) < τ ( x ∈ S l ) = L x L x + H x (16) such that τ ( x ∈ S l ) -τ ( δ(0) ∈ S l ) = ζ > 0. Then at the (t)-th step, we can express τ ( δ(0) ∈ S l ) as follows L (t) L (t) + H (t) = L (0) + t-1 t =0 η(t ) τ (t ) l L (0) + H (0) + t-1 t =0 η(t ) ( τ (t ) h + τ (t ) l ) = L (0) + t-1 t =0 η(t ) ( t t =0 η(t ) L x + τ (0) l ) L (0) + H (0) + t-1 t =0 η(t ) ( t t =0 η(t ) (L x + H x ) + τ (0) l + τ (0) h ) . Let a = t-1 t =0 η(t ) t t =0 η(t ) L x , b = t-1 t =0 η(t ) t t =0 η(t ) H x , c = t-1 t =0 η(t ) τ (0) l , d = t-1 t =0 η(t ) τ (0) h , then a a + b = τ ( x ∈ S l ), c c + d = τ (0) l τ (0) l + τ (0) h L (t) L (t) + H (t) = L (0) + a + c L (0) + H (0) + a + b + c + d = τ ( x ∈ S l ) 1 + c a + L (0) a 1 + L (0) +H (0) a+b + c+d a+b . If c a + L (0) a < L (0) + H (0) a + b + c + d a + b c L (0) + H (0) + τ ( δ(0) ∈ S l ) < τ ( x ∈ S l ) 1 + c + d L (0) + H (0) τ (0) l -τ ( x ∈ S l )( τ (0) l + τ (0) h ) < ζ L (0) + H (0) then we must have τ ( δ(t) ∈ S l ) < τ ( x ∈ S l ) at the t-th step. A stronger condition is that τ (0) l < τ ( x ∈ S l )( τ (0) l + τ (0) h ) then we will have τ ( δ(t) ∈ S l ) < τ ( x ∈ S l ) for all t. This condition approximately states that the "ratio" of low frequency component of τ (0) l + τ (0) h is smaller than that of clean data. B l 2 PGD PERTURBATIONS The PGD update rule for finding perturbations of x with learning rate ξ at step j + 1 is: δ (j+1) = P B(0, ) δ (j) + ξ ∂ ∂f (j) ∇ x f (j) , where B(0, ) is a ball centered at 0 with radius in Euclidean space and P is the projection operator defined as P B(0, ) [δ] = argmin δ ∈B(0, ) δ -δ 2 . Note that in this part quantities removing the step script j (e.g. ∂ ∂f and ∇ x f ) refers to not involving perturbations. For convenience, we use the same learning rate ξ for all steps j and instead explore the update of κ (j) = δ (j) ξ ( ) to the order of δ to see trends of PGD perturbations in frequency domain alongside PGD iterations since τ ( δ[k]) = τ (κ[k] ). For ease of notation, we adopt the following representations: Let φ (j) k denote the difference of phases between ∇f [k] and κ(j) [k] ; let β(j) = ∂ /∂f + δ (j) • ∇ x f where β(0) > 0 and one can drive similar results for β(0) < 0; we denote τ (j) l 2 kc k=0 |κ (j) [k]|| ∇f [k]| cos φ (j) k and τ (j) h 2 (d-1)/2 k=kc+1 |κ (j) [k]|| ∇f [k]| cos φ (j) k , where β(j) τ (j) l and β(j) τ (j) h are changed amounts of LFC and HFC of |κ| 2 at the (j + 1)th step of PGD iteration. We provide below our results on frequency spectrum of l 2 -norm PGD perturbations. Theorem 4 (The spectral trajectory of l 2 PGD perturbation) At the (j + 1)-th step of PGD, iteration of PGD will change the ratio of LFC of l 2 norm PGD adversarial perturbation for a neural network (1) which satisfies |∂ /∂f | = 1+ν with 0 < ν < 1 as follows, τ ( δ ∈ S l ) ← τ ( δ ∈ S l ) + β(j) τ (j) l τ δ(j) ∈ S h - τ (j) h τ δ(j) ∈ S l (d-1)/2 k=0 |κ (j+1) [k]| 2 . ( ) Remark If the two-layer neural network in (1) is trained with at least t ≥ t 1 steps (t 1 determined by theorem 1) such that τ ( ∇f ∈ S l ) > τ ( ∇f ∈ S h ), then, according to theorem 4, there exists j 0 = max    0, τ (0) h - τ (0) l β k=kc k=0 | ∇f [k]| 2 - k=(d-1)/2 k=kc | ∇f [k]| 2    , where β = max i∈[0,j0] β(i) , such that τ ( δ(j+1) ∈ S l ) > τ ( δ(j) ∈ S l ) for all j > j 0 if τ ( δ(j) ∈ S l ) < τ ( δ(j) ∈ S h ).

B.1 PROOF OF THEOREM 4

At the (j + 1)-th step of PGD update for κ (j+1) , if 1. P B(0, ) = I: κ (j+1) = κ (j) + ∂ ∂f ∇ x f + ∂ ∂f r a r σ δ (j) • W :,r W :,r + δ (j) • ∇ x f ∇ x f + O(δ 2 ); Under review as a conference paper at ICLR 2021 2. P (j) B(0, ) = I: κ (j+1) = κ (j) + ∇ x (j) κ (j) + ∂ ∂f (j) ∇ x f (j) . In either case, the quantity τ (κ[k]) is proportional to F κ (j) + ∂ ∂f (j) ∇ x f (j) 2 = F κ (j) + ∂ ∂f ∇ x f + δ (j) • ∇ x f ∇ x f 2 since the term δ∂ /∂f < δ 2 and can be dropped. Therefore, we now consider κ(j+1) [k] = κ(j) [k] + ∂ ∂f + δ (j) • ∇ x f ∇f [k] at the (j + 1)-th step of PGD in the frequency domain to explore ratio of frequency k to the whole frequency spectrum of perturbations τ δ(j+1) [k] ∝ |κ (j) [k]| 2 + 2 β + δ (j) • ∇ x f |κ (j) [k]|| ∇f [k]| cos( φ (j) k ). Lemma 1 (Dynamics of l 2 Norm PGD in the Frequency Domain) If the initialization of l 2 norm PGD adversarial perturbation satisfies ∂ ∂f +δ (0) •∇ x f > 0, then it will be positive at every iteration of PGDfoot_2 . Similar to Eq.( 15), one can derive the changed amount of τ ( δ ∈ S l ) in theorem 4 at the (j + 1)-th step and find the necessary condition.

C SUPPLEMENTARY EXPERIMENT

In this part, we present experiments on models with MSE loss and two-layer neural networks with fixed outer-layer parameters in Section 4.

C.1 MSE LOSS

We show the supplementary experiment of Section 4.1 in Fig. 7 and Fig. 8 . They have similar results to show that the spectrum of adversarial perturbations are more concentrated in the low-frequency domain after a sufficient training. We use two-layer neural networks with fixed outer-layer parameters. The input layer is of size 3 * 32 * 32 followed by a ReLU activation function, and the hidden layer is of size 500 followed by a sigmoid activation function. The dataset is still CIFAR10. We use random initialization and SGD optimizer without any training tricks. This setting can be used in practice and get higher than 40% test accuracy. The attack method is the same as the setup in Section 4. We show the LFC ratio of the adversarial perturbations during the training process in Fig. 11 . After sufficient training, we show the expectation of the spectrum in Fig. 12 (a) and the expectation of the log-spectrum difference in Fig. 12(b) . 



e.g. λ = 1 and γ = 1 4 if σ = ln(1 + e x ). e.g. Bernoulli distribution. A similar conclusion exists when ∂ ∂f + δ (0) • ∇xf < 0.



Fig. 1 visually shows the calculation methods of the spectrum of perturbations F(δ) and the logspectrum difference of adversarial examples | log(|F(x + δ)|) -log(|F(x)| ≈ |F (δ)| |F (x)| .

training step t as in Section 3.1 to study their HFC-concentration phenomenon. Since | ∇f (0) [k]| for a randomly initialized network (1) has no bias towards high frequency or low frequency while | x| ∝ k -α , we adopt the following setting R (0) k ∝ k α without loss of generality and present below our result on the log-spectrum difference of the adversarial examples.

Figure 2: The LFC ratio of the adversarial perturbations in the training process. (a) The attack method is l 2 FGM. (b) The attack method is l 2 PGD

Figure 4: The expectation of: (a) the spectrum of adversarial perturbations; (b) the log-spectrum difference of adversarial examples.

4(a). They show that the spectrum of adversarial perturbations are more concentrated in the low-frequency domain after a sufficient training. Ortiz-Jimenez et al. (2020) show that the distance to the boundary (margin) in different frequency bands is heavily dependent of the distribution used for training, which also supports our claims.

Fig.4(b) displays the expectation of the spectrum for 1000 random images in the test set. They both show that the log-spectrum difference of adversarial examples is generally concentrated around.

Figure 5: The log-spectrum difference of adversarial examples.In each 3×2-image part, the first line is the original example, the adversarial example attacked by PGD in a normally trained Resnet-32 and the perturbation. The second line is the log-spectrum of the original example, the log-spectrum of the adversarial example and the difference of the two log-spectrum.

Figure 6: Normally trained model's performance and robustness after a low-pass filter with cut-off from 1 to 22.

Figure 7: The difference of the spectrum between original and adversarial examples. The model are trained with MSE loss. Then we show the supplementary experiment of Section 4.2 in Fig. 9 and Fig. 10. They have similar results to show the log-spectrum difference of adversarial examples is generally concentrated around.In general, there is little empirical gap between MSE loss and CrossEntropy loss.

Figure 8: The expectation of the spectrum of adversarial perturbations. The model are trained with MSE loss.

Figure 9: The difference of the log-spectrum between original and adversarial examples. The model are trained with MSE loss.

Figure 10: The expectation of the log-spectrum of adversarial perturbations. The model are trained with MSE loss.

Figure 11: The LFC ratio of the adversarial perturbations for two-layer neural networks in the training process. (a) The attack method is l 2 FGM; (b) The attack method is l 2 PGD.

Figure 12: For two-layer neural network, the expectation of: (a) the spectrum of adversarial perturbations; (Due to the large value gap between LFC and HFC, it is hard to see the trend in original result, so we show the logarithm of the result which will not change the size relationship.) (b) the log-spectrum difference of adversarial examples.

Figure 13: the ratio of LFC of original images, perturbations in a normally trained model and perturbations in a PGD-attack adversarially trained model

