IMPROVING ADVERSARIAL ROBUSTNESS VIA FRE-QUENCY REGULARIZATION

Abstract

Deep neural networks (DNNs) are incredibly vulnerable to crafted, humanimperceptible adversarial perturbations. While adversarial training (AT) has proven to be an effective defense approach, the properties of AT for robustness improvement remain an open issue. In this paper, we investigate AT from a spectral perspective, providing new insights into the design of effective defenses. Our analyses show that AT induces the deep model to focus more on the low-frequency region, which retains the shape-biased representations, to gain robustness. Further, we find that the spectrum of a white-box attack is primarily distributed in regions the model focuses on, and the perturbation attacks the spectral bands where the model is vulnerable. To train a model tolerant to frequency-varying perturbation, we propose a frequency regularization (FR) such that the spectral output inferred by an attacked input stays as close as possible to its natural input counterpart. Experiments demonstrate that FR and its weight averaging (WA) extension could significantly improve the robust accuracy by 1.14% ∼ 4.57% relative to the AT, across multiple datasets (SVHN, CIFAR-10, CIFAR-100, and Tiny ImageNet), and various attacks (PGD, C&W, and Autoattack), without any extra data.

1. INTRODUCTION

DNNs have exhibited strong capabilities in various application such as computer vision He et al. (2016) , natural language processing Devlin et al. (2018) , recommendation systems Covington et al. (2016) , etc. However, researches in adversarial learning show that even well-trained DNNs are highly susceptible to adversarial perturbations Goodfellow et al. (2014) ; Szegedy et al. (2013) . These perturbations are nearly indistinguishable to human eyes but can mislead neural networks to completely erroneous outputs, thus endangering safety-critical application. Among various defense methods for improving robustness Das et al. ( 2018 On the other hand, frequency analysis provides a new lens on the generalization behavior of DNNs. Wang et al. (2020a) claim that convolutional neural networks (CNNs) could capture humanimperceptible high-frequency components of images for predictions. It is found that robust models have smooth convolutional kernels in the first layer, thereby paying more attention to low-frequency information. Yin et al. ( 2019) establish a connection between the frequency of common corruptions and model performance, especially for high-frequency corruptions. It views AT as a data augmentation method to bias the model toward low-frequency information, which improves the robustness to high-frequency corruptions at the cost of reduced robustness to low-frequency corruptions. Zhang & Zhu (2019) find that AT-CNNs are better at capturing long-range correlations such as shapes, and less biased towards textures than normally trained CNNs in popular object recognition datasets. Our findings are similar, but we learn them from a spectral perspective. Wang et al. (2020b) state that perturbations mainly focus on the high-frequency information in natural images, and low-frequency information is more robust than the high-frequency part. It is claimed that developing a stronger association between low-frequency information with true labels makes the model robust. However, our study shows that building this connection alone cannot render the model adversarial robustness. The closest work to ours is Maiya et al. ( 2021), which discovers that the adversarial perturbation is data-dependent and analyses many intriguing properties of AT with frequency constraints. Our research goes one step further to show that the perturbation is also model-dependent, and explains why it behaves differently across the datasets and models. Besides, we propose a frequency regularization (FR) to improve robust accuracy. These breakthroughs motivate us to zoom in on deeper AT analysis from a spectral viewpoint. Specifically, we obtain models with different frequency biases and study the distribution of their corresponding white-box attack perturbations across different datasets. We then propose a simple yet effective FR to improve the adversarial robustness and perform validation on multiple datasets. Our main contributions are: • We find that AT facilitates the model to focus on robust low-frequency information, which contains the shape-biased representation to improve the robustness. In contrast, simply focusing on lowfrequency information does not lead to adversarial robustness. • We reveal for the first time that the white-box attack is primarily distributed in the frequencies where the model focuses on, and can adapt its aggressive frequency distribution to the model's sensitivity to frequency corruptions. This explains why white-box attacks are hard to defend. • We propose a FR that enforces alignment of the outputs of natural and adversarial examples in the frequency domain, thus effectively improving the adversarial robustness.

2. PRELIMINARIES

Typically, AT updates the model weights to solve the min-max saddle point optimization problem: min θ 1 n n i=1 max ∥δ∥ p ≤ϵ L (f θ (x i + δ) , y i ) , where n is the number of training examples, x i +δ is the adversarial input within the ϵ-ball (bounded by an L p -norm) centered at the natural input x i , δ is the perturbation, y i is the true label, f θ is the DNN with weight θ, L(•) is the classification loss, e.g., cross-entropy (CE). We refer to the adversarially trained model as the robust model and the naturally trained model as the natural model. The accuracy achieved on natural and adversarial inputs is denoted as standard accuracy and robust accuracy, respectively. We define the high-pass filtering (HPF) with bandwidth k as the operation that after a Fast Fourier Transform (FFT), only the k × k patch in the center (viz. high frequencies) is preserved, and all external values are zeroed, and then applies inverse FFT. Low-pass filtering (LPF) is defined similarly except that the low-frequency part is shifted to the center after FFT, to be preserved by the center k × k patch as in Yin et al. ( 2019).

3.1. FREQUENCY ATTENTION & LOW-FREQUENCY INFORMATION

Attention to the Frequency Domain. Since the labels are inherently tied with the low-frequency information Wang et al. (2020a) , to maintain high standard accuracy and explore the connection between the low-frequency information and adversarial robustness, we train models (denoted as L-models) with the natural inputs after the LPF with a bandwidth of 16 for multiple datasets (32 for Tiny ImageNet), cf. Table 1 . Then, we feed natural inputs processed by LPF with different bandwidths into models to evaluate the accuracy, which reflects how much attention the models pay to low-frequency information. Results are shown in Table 1 . For natural models, the standard accuracy gradually increases as the bandwidth increases, indicating that the models utilize both low-and high-frequency information, consistent with the findings



); Mao et al. (2019); Zheng et al. (2020), adversarial training (AT) Madry et al. (2017), which feeds adversarial inputs into a DNN to solve a min-max optimization problem, proves to be an effective means without obfuscated gradients problems Athalye et al. (2018). Some recent results inspired by AT are also in place to further boost the robust accuracy: Zhang et al. (2019) identify a trade-off between standard and robust accuracies that serves as a guiding principle for designing the defenses. Wu et al. (2020) claim that the weight loss landscape is closely related to the robust generalization gap, and propose an effective adversarial weight perturbation method to overcome the robust overfitting problem Rice et al. (2020). Jia et al. (2022) introduce a learnable attack strategy to automatically produce the proper hyperparameters for generating the perturbations during training to improve the robustness.

