A UNIFIED APPROACH TO INTERPRETING AND BOOST-ING ADVERSARIAL TRANSFERABILITY

Abstract

In this paper, we use the interaction inside adversarial perturbations to explain and boost the adversarial transferability. We discover and prove the negative correlation between the adversarial transferability and the interaction inside adversarial perturbations. The negative correlation is further verified through different DNNs with various inputs. Moreover, this negative correlation can be regarded as a unified perspective to understand current transferability-boosting methods. To this end, we prove that some classic methods of enhancing the transferability essentially decease interactions inside adversarial perturbations. Based on this, we propose to directly penalize interactions during the attacking process, which significantly improves the adversarial transferability. Our code is available online 1 .

1. INTRODUCTION

Adversarial examples of deep neural networks (DNNs) have attracted increasing attention in recent years (Ma et al., 2018; Madry et al., 2018; Wang et al., 2019; Ilyas et al., 2019; Duan et al., 2020; Wu et al., 2020b; Ma et al., 2021) . Goodfellow et al. (2014) found the transferability of adversarial perturbations, and used perturbations generated on a source DNN to attack other target DNNs. Although many methods have been proposed to enhance the transferability of adversarial perturbations (Dong et al., 2018; Wu et al., 2018; 2020a) , the essence of the improvement of the transferability is still unclear. This paper considers the interaction inside adversarial perturbations as a new perspective to interpret adversarial transferability. Interactions inside adversarial perturbations are defined using the Shapley interaction index proposed in game theory (Michel & Marc, 1999; Shapley, 1953) . Given an input sample x ∈ R n , the adversarial attack aims to fool the DNN by adding an imperceptible perturbation δ ∈ R n on x. Each unit in the perturbation map is termed a perturbation unit. Let φ i denote the importance of the i-th perturbation unit δ i to attacking. φ i is implemented as the Shapley value, which will be explained later. The interaction between perturbation units δ i , δ j is defined as the change of the i-th unit's importance φ i when the j-th unit is perturbed w.r.t the case when the j-th unit is not perturbed. If the perturbation δ j on the j-th unit increases the importance φ i of the ith unit, then there is a positive interaction between δ i and δ j . If the perturbation δ j decreases the importance φ i , it indicates a negative interaction. In this paper, we discover and partially prove a clear negative correlation between the transferability and the interaction between adversarial perturbation units, i.e. adversarial perturbations with lower transferability tend to exhibit larger interactions between perturbation units. We verify such a correlation based on both the theoretical proof and comparative studies. Furthermore, based on the correlation, we propose to penalize interactions during attacking to improve the transferability. In fact, our research group led by Dr. Quanshi Zhang has proposed game-theoretic interactions, including interactions of different orders (Zhang et al., 2020) and multivariate interactions (Zhang et al., 2021c) . As a basic metric, the interaction can be used to explain signal processing in trained DNNs from different perspectives. For example, we have build up a tree structure to explain the hierarchical interactions between words in NLP models (Zhang et al., 2021a) . We have also used interactions to explain the generalization power of DNNs (Zhang et al., 2021b) . The interaction can also explain the utility of adversarial training (Ren et al., 2021) . As an extension of the system of game-theoretic interactions, in this study, we explain the adversarial transferability based on interactions. In this paper, the background for us to investigate the correlation between adversarial transferability and the interaction is as follows. First, we prove that multi-step attacking usually generates perturbations with larger interactions than single-step attacking. Second, according to (Xie et al., 2019) , multi-step attacking tends to generate more over-fitted adversarial perturbations with lower transferability than single-step attacking. We consider that the more dedicated interaction reflects more over-fitting towards the source DNN, which hurts adversarial transferability. In this way, we propose the hypothesis that the transferability and the interaction are negatively correlated. • Comparative studies are conducted to verify this negative correlation through different DNNs. • Unified explanation. Such a negative correlation provides a unified view to understand current transferability-boosting methods. We theoretically prove that some classic transferability-boosting methods (Dong et al., 2018; Wu et al., 2018; 2020a) essentially decrease interactions between perturbation units, which also verifies the hypothesis of the negative correlation. • Boosting adversarial transferability. Based on above findings, we propose a loss to decrease interactions between perturbation units during attacking, namely the interaction loss, in order to enhance the adversarial transferability. The effectiveness of the interaction loss further proves the negative correlation between the adversarial transferability and the interaction inside adversarial perturbations. Furthermore, we also try to only use the interaction loss to generate perturbations without the loss for the classification task. We find that such perturbations still exhibit moderate adversarial transferability for attacking. Such perturbations may decrease interactions encoded by the DNN, thereby damaging the inference patterns of the input. Our contributions are summarized as follows. (1) We reveal the negative correlation between the transferability and the interaction inside adversarial perturbations. (2) We provide a unified view to understand current transferability-boosting methods. (3) We propose a new loss to penalize interactions inside adversarial perturbations and enhance the adversarial transferability.

2. RELATED WORK

Adversarial transferability. Attacking methods can be roughly divided into two categories, i.e. white-box attacks (Szegedy et al., 2013; Goodfellow et al., 2014; Papernot et al., 2016; Carlini & Wagner, 2017; Kurakin et al., 2017; Su et al., 2017; Madry et al., 2018) and black-box attacks (Liu et al., 2016; Papernot et al., 2017; Chen et al., 2017a; Bhagoji et al., 2018; Ilyas et al., 2018; Bai et al., 2020) . A specific type of the black-box attack is based on the adversarial transferability (Dong et al., 2018; Wu et al., 2018; Xie et al., 2019; Wu et al., 2020a) , which transfers adversarial perturbations on a surrogate/source DNN to a target DNN. Thus, some previous studies focused on the transferability of adversarial attacking. Liu et al. (2016) demonstrated that non-targeted attacks were easy to transfer, while the targeted attacks were difficult to transfer. 



Wu et al. (2018) and Demontis et al. (2019) explored factors influencing the transferability, such as network architectures, model capacity, and gradient alignment. Several methods have been proposed to enhance the transferability of adversarial perturbations. The momentum iterative attack (MI Attack) (Dong et al., 2018) incorporated the momentum of gradients to boost the transferability. The variance-reduced attack (VR Attack) (Wu et al., 2018) used the smoothed gradients to craft perturbations with high transferability. The diversity input attack (DI Attack) (Xie et al., 2019) applied the adversarial attacking to randomly transformed input images, which included random resizing and padding with a certain probability. The skip gradient method (SGM Attack) (Wu et al., 2020a) used the gradients of the skip connection to improve the transferability. Dong et al. (2019) proposed the translation-invariant attack (TI Attack) to evade robustly trained DNNs. Li et al.

