AN ADVERSARIAL ATTACK VIA FEATURE CONTRIBUTIVE REGIONS

Abstract

Recently, to deal with the vulnerability to generate examples of CNNs, there are many advanced algorithms that have been proposed. These algorithms focus on modifying global pixels directly with small perturbations, and some work involves modifying local pixels. However, the global attacks have the problem of perturbations' redundancy and the local attacks are not effective. To overcome this challenge, we achieve a trade-off between the perturbation power and the number of perturbed pixels in this paper. The key idea is to find the feature contributive regions (FCRs) of the images. Furthermore, in order to create an adversarial example similar to the corresponding clean image as much as possible, we redefine a loss function as the objective function of the optimization in this paper and then using gradient descent optimization algorithm to find the efficient perturbations. Our comprehensive experiments demonstrate that FCRs attack shows strong attack ability in both white-box and black-box settings for both CIFAR-10 and ILSVRC2012 datasets.

1. INTRODUCTION

The development of deep learning technology has promoted the successful application of deep neural networks (DNNs) in various fields, such as image classification (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014 ), computer vision (He et al., 2016; Taigman et al., 2014) , natural language processing (Devlin et al., 2018; Goldberg, 2017) , etc. In particular, convolutional neural networks (CNNs), a typical DNNs, have shown excellent performance applied in image classification. However, many works have shown that CNNs are extremely vulnerable to adversarial examples (Szegedy et al., 2013) . The adversarial example is crafted from clean example added by well-designed perturbations that are almost imperceptible to human vision, while can fool CNNs. Scholars have proposed a variety of methods to craft adversarial samples, such as L-BFGS (Szegedy et al., 2013 ), FGSM (Goodfellow et al., 2014 ), I-FGSM (Kurakin et al., 2016) , PGD (Madry et al., 2017) and C&W (Carlini & Wagner, 2017) . These attack strategies can successfully mislead CNNs to make incorrect predictions, restricting the application of CNNs in certain security-sensitive areas (such as autonomous driving, financial payments based on face recognition, etc.). Therefore, learning how to generate adversarial examples is of great significance. We can categorize these attacks into two categories, i.e., the global attacks and the local attacks, according to the region added perturbations. The global attacks tempt to perturb all pixels of the clean image, such as FGSM (Goodfellow et al., 2014) , PGD (Madry et al., 2017) and C&W (Carlini & Wagner, 2017) ; the local attacks only modify some pixels of the clean image, such as one-pixel attacks (Su et al., 2019) and JSMA (Papernot et al., 2016b) . At present, the global attacks perturb all pixels on the whole image, which not only fail to destroy the feature contributive regions (the critical semantics of an image), but they also increase the degree of image distortion. We explain in detail in the experimental part. The local attacks seem to be able to solve this problem, but the current proposed local attacks don't well realize that focus on undermining the image feature contributive regions. Papernot et al. (2016b) proposed a method of crafting adversarial example based on the Jacobian Saliency Map by constraining the 0 norm of the perturbations, which means that only a few pixels in the image are modified. However, this method has the disadvantage of over-modifying the value of the pixels, making the added perturbations easily perceptible by the naked eye, and its adversarial strength is weak (Akhtar & Mian, 2018) . Su et al. (2019) proposed an extremely adversarial attack-one-pixel attack. One-pixel attack can fool CNNs by changing 1 to 5 pixels, but this method is better for low-resolution images attack (such as CIFAR-10), and the attack success rate for high-resolution images will be greatly reduced (such as ImageNet), and the cost is very large 1 distortion (Xu et al., 2018) . In this paper, we propose a novel attack method to overcome the redundant perturbations of the global attacks and the poor strength of the proposed local attacks. Inspired by the work of CAM (Zhou et al., 2016) and Grad-CAM (Selvaraju et al., 2017) , it is the most effective way to reduce image distortion, high efficiency and reduce computational complexity by adding perturbations to the critical semantics. As we all know, CNN is an end-to-end representation learning model, which starts from simple low-level features and combines them into abstract high-level features layer by layer. Thus, Grad-CAM (Selvaraju et al., 2017) uses the gradient information of the last convolutional layer as the metric to understand the decision of each neuron for target classification, and explains in a visual way that not all image pixels contribute to the model classification. Similarly, as shown in Figure 1 , the red area is the main contributive area. Therefore, perturbing the image globally is not the most efficient strategy. We propose the FCRs attack strategy, which only adds perturbations in Feature Contributive Regions (FCRs) with the aim of generating sparse and more excellent perturbations. Especially, compared with existing local attacks, our proposed method perturbs continuous semantic regions rather than discrete pixels. In this work, we use Grad-CAM to locate regions that have a greater impact on the classification decision of CNNs. To ensure the similarity between the adversarial example and the corresponding clean image as much as possible, the objective function we optimize is the sum of the two parts of the function: the 2 norm of the perturbations and the loss function of the generated adversarial examples. We thus use the stochastic gradient descent optimization algorithm to find efficient perturbations. In order to avoid the situation where the perturbations do not update when the objective function tends to zero, we also introduce inverse temperature T under the inspiration of Hinton et al. (2015) . Compared to previous work, the contributions of our work are summarized as follows: • We propose an attack via feature contributive regions (FCRs) for achieving a trade-off between the powerful attack and the small perturbations. More importantly, this work implements an effective local attack algorithm by redefining an objective function. • Specially, we novelly propose an inverse temperature T , which avoids the situation where the loss function of the generated adversarial example tends to be zero when the stochastic gradient descent optimization algorithm is used to find the perturbations. • Comprehensive experiments demonstrate that FCRs attack consistently outperforms stateof-the-art methods on the CIFAR-10 and ILSVRC-2012 datasets. In addition, we verify the importance of FCRs by dividing the original clean image into two parts (i.e., FCRs and Non-FCRs).

2. RELATED WORK

In many cases, the CNNs are vulnerable to adversarial attacks which have caused extensive research in academia. Szegedy et al. (2013) used the constrained L-BFGS algorithm to craft adversarial ex-



Figure 1: We use Grad-CAM to get the heatmap of the image and the red frame area is the feature contribution regions involved in this work.

