A UNIFIED APPROACH TO INTERPRETING AND BOOST-ING ADVERSARIAL TRANSFERABILITY

Abstract

In this paper, we use the interaction inside adversarial perturbations to explain and boost the adversarial transferability. We discover and prove the negative correlation between the adversarial transferability and the interaction inside adversarial perturbations. The negative correlation is further verified through different DNNs with various inputs. Moreover, this negative correlation can be regarded as a unified perspective to understand current transferability-boosting methods. To this end, we prove that some classic methods of enhancing the transferability essentially decease interactions inside adversarial perturbations. Based on this, we propose to directly penalize interactions during the attacking process, which significantly improves the adversarial transferability. Our code is available online 1 .

1. INTRODUCTION

Adversarial examples of deep neural networks (DNNs) have attracted increasing attention in recent years (Ma et al., 2018; Madry et al., 2018; Wang et al., 2019; Ilyas et al., 2019; Duan et al., 2020; Wu et al., 2020b; Ma et al., 2021) . Goodfellow et al. (2014) found the transferability of adversarial perturbations, and used perturbations generated on a source DNN to attack other target DNNs. Although many methods have been proposed to enhance the transferability of adversarial perturbations (Dong et al., 2018; Wu et al., 2018; 2020a) , the essence of the improvement of the transferability is still unclear. This paper considers the interaction inside adversarial perturbations as a new perspective to interpret adversarial transferability. Interactions inside adversarial perturbations are defined using the Shapley interaction index proposed in game theory (Michel & Marc, 1999; Shapley, 1953) . Given an input sample x ∈ R n , the adversarial attack aims to fool the DNN by adding an imperceptible perturbation δ ∈ R n on x. Each unit in the perturbation map is termed a perturbation unit. Let φ i denote the importance of the i-th perturbation unit δ i to attacking. φ i is implemented as the Shapley value, which will be explained later. The interaction between perturbation units δ i , δ j is defined as the change of the i-th unit's importance φ i when the j-th unit is perturbed w.r.t the case when the j-th unit is not perturbed. If the perturbation δ j on the j-th unit increases the importance φ i of the ith unit, then there is a positive interaction between δ i and δ j . If the perturbation δ j decreases the importance φ i , it indicates a negative interaction. In this paper, we discover and partially prove a clear negative correlation between the transferability and the interaction between adversarial perturbation units, i.e. adversarial perturbations with lower transferability tend to exhibit larger interactions between perturbation units. We verify such a correlation based on both the theoretical proof and comparative studies. Furthermore, based on the correlation, we propose to penalize interactions during attacking to improve the transferability. In fact, our research group led by Dr. Quanshi Zhang has proposed game-theoretic interactions, including interactions of different orders (Zhang et al., 2020) and multivariate interactions (Zhang et al., 2021c) . As a basic metric, the interaction can be used to explain signal processing in trained DNNs from different perspectives. For example, we have build up a tree structure to explain the hierarchical interactions between words in NLP models (Zhang et al., 2021a) . We have also used interactions to explain the generalization power of DNNs (Zhang et al., 2021b) . The interaction can also explain the utility of adversarial training (Ren et al., 2021) . As an extension of the system of game-theoretic interactions, in this study, we explain the adversarial transferability based on interactions. In this paper, the background for us to investigate the correlation between adversarial transferability and the interaction is as follows. First, we prove that multi-step attacking usually generates perturbations with larger interactions than single-step attacking. Second, according to (Xie et al., 2019) , multi-step attacking tends to generate more over-fitted adversarial perturbations with lower transferability than single-step attacking. We consider that the more dedicated interaction reflects more over-fitting towards the source DNN, which hurts adversarial transferability. In this way, we propose the hypothesis that the transferability and the interaction are negatively correlated. • Comparative studies are conducted to verify this negative correlation through different DNNs. • Unified explanation. Such a negative correlation provides a unified view to understand current transferability-boosting methods. We theoretically prove that some classic transferability-boosting methods (Dong et al., 2018; Wu et al., 2018; 2020a) essentially decrease interactions between perturbation units, which also verifies the hypothesis of the negative correlation. • Boosting adversarial transferability. Based on above findings, we propose a loss to decrease interactions between perturbation units during attacking, namely the interaction loss, in order to enhance the adversarial transferability. The effectiveness of the interaction loss further proves the negative correlation between the adversarial transferability and the interaction inside adversarial perturbations. Furthermore, we also try to only use the interaction loss to generate perturbations without the loss for the classification task. We find that such perturbations still exhibit moderate adversarial transferability for attacking. Such perturbations may decrease interactions encoded by the DNN, thereby damaging the inference patterns of the input. Our contributions are summarized as follows. (1) We reveal the negative correlation between the transferability and the interaction inside adversarial perturbations. (2) We provide a unified view to understand current transferability-boosting methods. (3) We propose a new loss to penalize interactions inside adversarial perturbations and enhance the adversarial transferability.

2. RELATED WORK

Adversarial transferability. Attacking methods can be roughly divided into two categories, i.e. white-box attacks (Szegedy et al., 2013; Goodfellow et al., 2014; Papernot et al., 2016; Carlini & Wagner, 2017; Kurakin et al., 2017; Su et al., 2017; Madry et al., 2018) and black-box attacks (Liu et al., 2016; Papernot et al., 2017; Chen et al., 2017a; Bhagoji et al., 2018; Ilyas et al., 2018; Bai et al., 2020) . A specific type of the black-box attack is based on the adversarial transferability (Dong et al., 2018; Wu et al., 2018; Xie et al., 2019; Wu et al., 2020a) , which transfers adversarial perturbations on a surrogate/source DNN to a target DNN. Thus, some previous studies focused on the transferability of adversarial attacking. Liu et al. (2016) demonstrated that non-targeted attacks were easy to transfer, while the targeted attacks were difficult to transfer. Wu et al. (2018) and Demontis et al. (2019) explored factors influencing the transferability, such as network architectures, model capacity, and gradient alignment. Several methods have been proposed to enhance the transferability of adversarial perturbations. The momentum iterative attack (MI Attack) (Dong et al., 2018) incorporated the momentum of gradients to boost the transferability. The variance-reduced attack (VR Attack) (Wu et al., 2018) used the smoothed gradients to craft perturbations with high transferability. The diversity input attack (DI Attack) (Xie et al., 2019) applied the adversarial attacking to randomly transformed input images, which included random resizing and padding with a certain probability. The skip gradient method (SGM Attack) (Wu et al., 2020a) used the gradients of the skip connection to improve the transferability. Dong et al. (2019) proposed the translation-invariant attack (TI Attack) to evade robustly trained DNNs. Li et al. (2020) used the dropout erosion and the skip connection erosion to improve the transferability. In comparison, we explain the transferability based on game theory, and discover the negative correlation between the transferability and interactions as a unified explanation for some above methods. Interaction. The interaction between input variables has been widely investigated. Michel & Marc (1999) proposed the Shapley interaction index based on the Shapley value (Shapley, 1953) in game theory. Daria Sorokina (2008) defined the interaction of K input variables of additive models. Scott Lundberg (2017) quantified interactions between each pair of input variables for tree-ensemble models. Some studies mainly focused on interactions to analyze DNNs. Tsang et al. (2018) measured statistical interactions based on DNN weights. Murdoch et al. (2018) proposed to extract interactions in LSTMs by disambiguating information of different gates, and Singh et al. (2019) extended this method to CNNs. Jin et al. (2020) quantified the contextual independence of words to hierarchically explain the LSTMs. Janizek et al. ( 2020) extended the method of Integrated Gradients (Sundararajan et al., 2017) to quantify pairwise interactions of input features based on the Hessian matrix, which required the DNN to use the SoftPlus operation replace the ReLU operation. Chen et al. (2020) extended the attribution in (Chen & Ji, 2020) to use the Shapley interaction index to generate hierarchical explanations of NLP tasks. In comparison, in this study, we use the Shapley interaction index to explain and improve the transferability of adversarial perturbations.

3. THE RELATIONSHIP BETWEEN TRANSFERABILITY AND INTERACTIONS

Preliminaries: the Shapley value. The Shapley value was first proposed in game theory (Shapley, 1953) . Considering multiple players in a game, each player aims to win a high reward. The Shapley value is considered as a unique and unbiased approach to fairly allocating the total reward gained by all players to each player (Weber, 1988) . The Shapley value satisfies four desirable properties, i.e. the linearity, dummy, symmetry, and efficiency (please see the Appendix A.1 for details). Let Ω = {1, 2, . . . , n} denote the set of all players, and let v(•) denote the reward function. v(S) represents the reward obtained by a set of players S ⊆ Ω. The Shapley value φ(i|Ω) unbiasedly measures the contribution of the i-th player to the total reward gained by all players in Ω, as follows. i φ(i|Ω) = v(Ω) -v(∅), φ(i|Ω) = S⊆Ω\{i} |S|! (n -|S|-1)! n! (v(S ∪ {i}) -v(S)). (1) Adversarial attack. Given an input sample x ∈ [0, 1] n with the true label y ∈ {1, 2, . . . , C}, we use h(x) ∈ R C to denote the output of the DNN before the softmax layer. To simplify the story, in this study, we mainly focus on the untargeted adversarial attack. The goal of the untargeted adversarial attack is to add a human-imperceptible perturbation δ ∈ R n on the sample x, and make the DNN classify the perturbed sample x = x + δ into an incorrect category, i.e. arg max y h y (x ) = y. The objective of adversarial attacking is usually formulated as follows. maximize δ (h(x + δ), y) s.t. δ p ≤ , x + δ ∈ [0, 1] n , where (h(x + δ), y) is referred to the classification loss, and is a constant of the norm constraint. Please see Appendix C for technical details of solving Equation (2).

3.1. THEORETICAL UNDERSTANDING OF THE ADVERSARIAL ATTACK IN GAME THEORY.

In adversarial attacking, given the perturbation δ ∈ R n , we use Ω = {1, 2, . . . , n} to denote all units/dimensions in the perturbation. We use the Shapley value in Equation (1) to measure the contribution of each perturbation unit i ∈ Ω to the attack. To this end, it requires us to define the utility of a subset of perturbation units S ⊆ Ω for attacking, which can be formulated as v(S) = max y =y h y (x + δ (S) ) -h y (x + δ (S) ), according to Equation (2). h y (•) is the value of the y-th element of h(•) ∈ R C . δ (S) ∈ R n is the perturbation which only contains perturbation units in S, i.e. ∀i ∈ S, δ (S) i = δ i ; ∀i / ∈ S, δ (S) i = 0. In this way, v(Ω) = max y =y h y (x + δ) -h y (x + δ) denotes the utility of all perturbation units, and v(∅) = max y =y h y (x)-h y (x) denotes the baseline score without perturbations. Thus, the overall contribution of perturbation units can be measured as v(Ω) -v(∅). We apply the Shapley value in Equation (1) to assign the overall contribution to each perturbation unit as i φ(i|Ω) = v(Ω) -v(∅), where φ(i|Ω) denotes the contribution of the i-th perturbation unit. Interactions. Perturbation units do not contribute to the adversarial utility independently. For example, perturbation units may form a certain pattern, e.g. an edge in the image. Thus, perturbations units in the edge must appear together. The absence of a few units in the pattern may invalidate this pattern. Let us consider two perturbation units i, j. According to (Michel & Marc, 1999) , the Shapley interaction index between units i, j is defined as the additional contribution as follows. I ij (δ) = φ(S ij |Ω ) -[φ(i|Ω \ {j}) + φ(j|Ω \ {i})] , where φ(i|Ω \ {j}) and φ(j|Ω \ {i}) represent the individual contributions of units i and j, respectively, when the perturbation units i, j work individually. Note that φ(i|Ω \ {j}) is computed in the scenario of considering the unit j always absent. i φ(i|Ω \ {j}) = v(Ω \ {j}) -v(∅), due to the absence of perturbation unit j. φ(S ij |Ω ) denotes the joint contribution of i, j, when perturbation units i, j are regarded as a singleton unit S ij = {i, j}. In this case, units i, j are supposed to be always perturbed or not perturbed simultaneously, and we can consider that there are only n -1 players in the game. Thus, the set of all perturbation units is considered as Ω = Ω \ {i, j} ∪ S ij . The joint contribution of S ij is denoted by φ(Sij|Ω ), s.t. i ∈Ω \{S ij } φ(i |Ω ) + φ(Sij|Ω ) = v(Ω ) -v(∅). The interaction defined in Equation ( 3) is equivalent to the change of the i-th unit's importance φ i when the unit j exists w.r.t the case when the unit j is absent. Please see Appendix D for details. If I ij (δ) > 0, it means δ i and δ j cooperate with each other, i.e. the interaction is positive; if I ij (δ) < 0, it means δ i and δ j conflict with each other, i.e. the interaction is negative. The absolute value of |I ij (δ)| indicates the interaction strength. The interaction is symmetric that I ij (δ) = I ji (δ). We are given an input sample x ∈ R n and a DNN h(•) trained for classification. With the definition of interactions, in adversarial attacking, we have the following propositions: Proposition 1. (Proof in Appendix E) The adversarial perturbation generated by the multi-step attack via gradient descent is given as δ m multi = α m-1 t=0 ∇ x (h(x + δ t multi ), y) , where δ t multi denotes the perturbation after the t-th step of updating, and m is referred to as the total number of steps. The adversarial perturbation generated by the single-step attack is given as δ single = αm∇ x (h(x), y). Then, the expectation of interactions between perturbation units in δ m multi , E a,b [I ab (δ m multi )], is larger than E a,b [I ab (δ single )], i.e. E a,b [I ab (δ m multi )] ≥ E a,b [I ab (δ single )]. Note that when we compare interactions inside different perturbations, magnitudes of these perturbations should be similar. It is because the comparison of interactions between adversarial perturbations of different magnitudes is not fair. Therefore, we use the step size αm in the single-step attack to roughly (not accurately) balance the magnitude of perturbations. The fairness is further discussed in Appendix E.1. Proposition 1 shows that, in general, adversarial perturbations generated by the multi-step attack tend to exhibit larger interactions than those generated by the single-step attack. In addition, Appendix E.4 shows that the multi-step attack usually generates perturbations with larger interactions than noisy perturbations of the same magnitude. Besides, Xie et al. (2019) demonstrated that the multi-step attack tends to over-fit the source DNN, which led to low transferability. Intuitively, large interactions mean a strong cooperative relationship between perturbation units, which indicates the significant over-fitting towards adversarial perturbations oriented to the source DNN. In this way, we propose the hypothesis that the adversarial transferability and the interactions inside adversarial perturbations are negatively correlated.

3.2. EMPIRICAL VERIFICATION OF THE NEGATIVE CORRELATION

To verify the negative correlation between the transferability and interactions, we conduct experiments to examine whether adversarial perturbations with low transferability tend to exhibit larger interactions than those perturbations with high transferability. Given a source DNN and an input sample x, we generate the adversarial example x = x + δ. Then, given a target DNN h (t) , we measure the transfer utility of δ as Transfer Utility = [max y =y h (t) y (x+δ)-h (t) y (x+δ)]-[max y =y h (t) y (x)- h (t) y (x)] as mentioned in Section 3.1. The interaction is given as Interaction = E i,j [I ij (δ)], which is computed on the source DNN. Note that the computational cost of I ij (δ) is NP-hard. However, we prove that we can simplify the computation of the average interaction over all pairs of units as Setting 2 Figure 1 : The negative correlation between the transfer utility and the interaction. The correlation is computed as the Pearson correlation. The blue shade in each subfigure represents the 95% confidence interval of the linear regression. follows, which significantly reduces the computational cost. Please see Appendix F for the proof. E i,j [I ij (δ)] = 1 n -1 E i [v(Ω) -v(Ω \ {i}) -v({i}) + v(∅)] . Using 50 images randomly sampled from the validation set of the ImageNet dataset (Russakovsky et al., 2015) , we generate adversarial perturbations on four types of DNNs, including ResNet-34/152(RN-34/152) (He et al., 2016) and DenseNet-121/201(DN-121/201) (Huang et al., 2017) . We transfer adversarial perturbations generated on each ResNet to DenseNets. Similarly, we also transfer adversarial perturbations generated on each DenseNet to ResNets. Figure 1 shows the negative correlation between the transfer utility and the interaction. Each subfigure corresponds to a specific pair of source DNN and target DNN. In each subfigure, each point represents the average transfer utility and the average interaction of adversarial perturbations through all testing images. Different points represent the average interaction and the average transfer utility computed using different hyper-parameters. Given an input image x, adversarial perturbations are generated by solving the relaxed form of Equation ( 2) via the gradient descent, i.e. min δ -(h(x + δ), y) + c • δ p p s.t. x + δ ∈ [0, 1] n , where c ∈ R is a scalar. In this way, we gradually change the value of c and set different values of pfoot_0 as different hyper-parameters to generate different adversarial perturbations, thereby drawing different points in each subfigure. Fair comparisons require adversarial perturbations generated with different hyper-parameters c to be comparable with each other. Thus, we select a constant τ and take δ 2 = τ as the stopping criteria of all adversarial attacks. Please see Appendix G for more details.

4. UNIFIED UNDERSTANDING OF TRANSFERABILITY-BOOSTING ATTACKS

In this section, we prove that some classical methods of improving the adversarial transferability essentially decrease interactions between perturbation units, although these methods are not originally designed to decrease the interaction. Without loss of generality, let us be given an input sample x ∈ R n and a DNN h(•) trained for classification. • VR Attack (Wu et al., 2018) smooths the classification loss with the Gaussian noise during attacking. In the VR Attack, the gradient of the input sample is computed as follows. g t = E ξ∼N (0,σ 2 I) [∇ x (h(x + δ t + ξ), y)]. The following proposition proves that the VR Attack is prone to decrease interactions inside perturbation units. Proposition 2. (Proof in Appendix H) The adversarial perturbation generated by the multi-step attack is given as δ m multi = α m-1 t=0 ∇ x (h(x + δ t multi ), y). The adversarial perturbation generated by the VR Attack is computed as δ m vr = α m-1 t=0 ∇ x ˆ (h(x + δ t vr ), y), where ˆ (h(x + δ t vr ), y) = E ξ∼N (0,σ 2 I) [ (h(x + δ t vr + ξ), y)]. Perturbation units of δ m vr tend to exhibit smaller interactions than δ multi , i.e. E x E a,b [I ab (δ m vr )] ≤ E x E a,b [I ab (δ m multi )]. Besides the theoretical proof, we also conduct experiments to compare interactions of perturbation units generated by the baseline multi-step attack (implemented as (Madry et al., 2018) ) with those of perturbation units generated by the VR Attack. Table 5 shows that the VR Attack exhibits lower interactions between perturbation units than the baseline multi-step attack. • MI Attack (Dong et al., 2018) incorporates the momentum of gradients when updating the adversarial perturbation. In the MI Attack, the gradient used in step t is computed as follows. g t = µ • g t-1 + ∇ x h x + δ t-1 , y / ∇ x h x + δ t-1 , y 1 . Note that the original MI Attack and the multi-step attack cannot be directly compared, since that magnitudes of the generated perturbations cannot be fairly controlled. The values of interactions are sensitive to the magnitude of perturbations. Comparing perturbations with different magnitudes is not fair. Thus, we slightly revise the MI Attack as ∀t > 0, g t mi = µg t-1 mi + (1 -µ)∇ x (h(x + δ t-1 mi ), y); g 0 mi = 0, where µ = (t -1)/t. We investigate the interaction of adversarial perturbations generated by the original multi-step attack and the MI Attack. We prove the following proposition, which shows that the MI Attack decreases the interaction between perturbation units in most cases. Proposition 3. (Proof in Appendix I) The adversarial perturbation generated by the multi-step attack is given as δ m multi = α m-1 t=0 ∇ x (h(x+δ t multi ), y). The adversarial perturbation generated by the multi-step attack incorporating the momentum is computed as δ m mi = α m-1 t=0 g t mi . Perturbation units of δ m mi exhibit smaller interactions than δ m multi , i.e. E a,b [I ab (δ m mi )] ≤ E a,b [I ab (δ m multi )]. • SGM Attack (Wu et al., 2020a) exploits the gradient information of the skip connection in ResNets to improve the transferability of adversarial perturbations. The SGM Attack revises the gradient in the backpropagation, which can be considered as to add a specific dropout operation in the backpropagation. We notice that Zhang et al. (2021b) has proved that the dropout operation can decrease the significance of interactions, so as to decrease the significance of the over-fitting of DNNs. Thus, this also proves that the SGM Attack decreases interactions between perturbation units. Besides the theoretical proof, we also conduct experiments to compare interactions of perturbation units generated by the baseline multi-step attack (implemented as Madry et al. ( 2018)) with those of perturbation units generated by the SGM Attack. Table 5 shows that the SGM Attack exhibits lower interactions than the baseline multi-step attack.

5. THE INTERACTION LOSS FOR TRANSFERABILITY ENHANCEMENT

Interaction loss. Based on findings in previous sections, we propose a loss to directly penalize interactions during attacking, in order to improve the transferability of adversarial perturbations. Based on Equation (2), we jointly optimize the classification loss and the interaction loss to generate adversarial perturbations. This method is termed the interaction-reduced attack (IR Attack). max δ [ (h(x + δ), y) -λ interaction ], interaction = E i,j [I ij (δ)] s.t. δ p ≤ , x + δ ∈ [0, 1] n , ( ) where interaction is the interaction loss, and λ is a constant weight for the interaction loss. Although the computation of the interaction loss can be simplified according to Equation ( 4), the computational cost of the interaction loss is intolerable, when the dimension of images is high. Therefore, as a trade-off between the accuracy and the computational cost, we divide the input image into 16 × 16 grids. We measure and penalize interactions at the grid level, instead of the pixel level. Moreover, we apply an efficient sampling method to approximate the expectation operation during the computation of interactions in Equation (4). Figure 2 visualizes interactions between adjacent perturbation units at the grid level generated with and without the interaction loss. Experiments. For implementation, we generated adversarial perturbations on six different source DNNs, including Alexnet (Krizhevsky et al., 2012) , VGG-16 (Simonyan & Zisserman, 2015) , ResNet-34/152 (RN-34/152) (He et al., 2016) and DenseNet-121/201 (DN-121/201) (Huang et al., 2017) . For each source DNN, we tested the transferability of the generated perturbations on seven target DNNs, including VGG-16, ResNet-152 (RN-152), DenseNet-201 (DN-201), SENet-154 (SE-154) (Hu et al., 2018) , InceptionV3 (IncV3) (Szegedy et al., 2016) , InceptionV4 (IncV4) (Szegedy et al., 2017) , and Inception-ResNetV2 (IncResV2) (Szegedy et al., 2017) . In addition, three stateof-the-art DNNs, including the Dual-Path-Network (DPN-68) (Chen et al., 2017b) , the NASNet-LARGE (NASN-L) (Zoph et al., 2018) , and the Progressive NASNet (PNASN) (Liu et al., 2018) , were used as target DNNs to evaluate the ensemble source model (will be introduced in the next paragraph). Besides unsecured target DNNs mentioned above, we also used three secured target  [i] ∝ E j∈Ni [I ij (δ)], where N i denotes the set of adjacent perturbation units of the perturbation unit i. Here, we ignore interactions between non-adjacent units to simplify the visualization. It is because adjacent units usually encode much more significant interactions than other units. The interaction loss forces the perturbation to encode more negative interactions. models for testing, which were learned via ensemble adversarial training: IncV3 ens3 (ensemble of three IncV3 networks), IncV3 ens4 (ensemble of four IncV3 networks), and IncResV2 ens3 (ensemble of three IncResV2 networks), which were released by Tramèr et al. (2017) . Ensemble source model: Besides above adversarial transferring from a single-source model, we also conducted the proposed IR Attack in the scenario of the ensemble-based attacking (Liu et al., 2016) , in order to generate adversarial perturbations on the ensemble of Baselines. The first baseline method, the PGD Attack (Madry et al., 2018) , directly solved the Equation (2), which was widely used for adversarial attacks. Besides this baseline attack, the other four baselines were the MI Attack (Dong et al., 2018) , the VR Attack (Wu et al., 2018) , the SGM Attack (Wu et al., 2020a) , and the TI Attack (Dong et al., 2019) . Our method was implemented according to Equation (5), namely the IR Attack. Because the SGM Attack was one of the top-ranked methods of boosting the adversarial transferability, we further added the interaction loss interaction to the SGM Attack as another implementation of our method (namely the SGM+IR Attack). We also used the interaction loss to boost the performance of the MI Attack and the VR Attack (namely MI+IR and VR+IR, respectively). Please see Appendix M.1 for details. Moreover, as Section 4 states, the MI Attack, VR Attack, and SGM Attack also decrease interactions during attacking. Thus, we combined the IR Attack with all these interaction-reducing techniques together as a new implementation of our method, namely the HybridIR Attack. All attacks were conducted with 100 stepsfoot_1 on randomly selected 1000 images of the validation set in the ImageNet dataset. We set = 16/255 for the L ∞ attack, and set = 16/255 √ n following the setting in (Dong et al., 2018 ) for the L 2 attack. The step size was set to 2/255 for all attacks. Considering the efficiency of signal processing in DNNs with different depths, we set λ = 1 for the IR Attack, when the source DNN was ResNet. We set λ = 2, for other source DNNs. To enable fair comparisons, the transferability of each baseline was computed based on the best adversarial perturbation during the 100 steps via the leave-one-out (LOO) validation. Please see the Appendix K for the motivation and the evidence of the LOO evaluation of transferability. All attacks were conducted with three different random samplings of grids or different initial perturbations. Table 1 reports the success rates of the baseline attack (PGD (Madry et al., 2018) ) and the IR Attack, namely PGD L ∞ +IR of L ∞ attacks and PGD L 2 +IR of L 2 attacks. Compared with the baseline attack, the transferability was significantly improved by the interaction loss on various source models against different target models. Let us focus on the L ∞ attack. For most source models and target models, the transferability enhancement brought by the interaction loss was more than 10%. In particular, when the source DNN and the target DNN were DN-201 and IncV4, respectively, the baseline attack achieved the transferability of 36.5%. With the interaction loss, the transferability was improved to 63.7% (> 27% gain). As Table 2 shows, in most cases, the IR Attack on the ensemble model generated more transferable perturbations than the PGD Attack. Besides, as Table 3 shows, our interaction loss also improved the transferability against the secured target DNNs. Such improvement further verified the negative correlation between transferability and interactions. Note that we did not use the LOO in Table 3 , in order to make experimental settings in this table consistent with the evaluation used by Tramèr et al. (2017) . Table 4 shows the improvement of the transferability obtained by the interaction loss on other attacking methods. The interaction loss could further boost the transferability of state-of-the-art transfer attacks. Without the interaction loss, the highest transferability made by the SGM Attack against the IncResV2 was 68.8% (when the source is DN-201). When the interaction loss was added, the transferability was improved to 81.5% (> 12% gain). Moreover, the HybridIR Attack, which combined all methods of reducing interactions together, improved success rates from the range of 54.6%∼98.8% to the range of 70.2%∼99.1%. We can understand behaviors of the proposed interaction loss as follows. Different methods generate adversarial perturbations in different manifolds, thereby exhibiting different transferability. Based on the current perturbation, the interaction loss can point out the optimization direction towards further decrease of interactions in a local manner due to its optimization power. Thus, the interaction loss further boosts the transferability. To further demonstrate the broad applicability of the interaction loss, besides untargeted attacks on the ImageNet dataset, we also conducted targeted attacks on the CIFAR-10 dataset (Krizhevsky & Hinton, 2009) . Experimental results consistently showed that the adversarial transferability can be enhanced by reducing interactions in targeted attacks. Please see Appendix. M.2 for details. Effects of the interaction loss. We tested the transferability of perturbations generated by the IR Attack with different weights of the interaction loss λ. In particular, the baseline attack (PGD) can be considered as the IR Attack when λ = 0. We conducted attacks on two source DNNs (RN-34, DN-121), and transferred adversarial perturbations to seven target DNNs (VGG16, RN-152, DN-201, SE-154, IncV3, IncV4, IncResV2). The attacks were conducted with 100 steps 3 on validation images in ImageNet . Figure 3 (a) shows the black-box success rates with different values of λ. The transferability of the IR Attack increased along with the increase of the weight λ. Attack only with the interaction loss. To further understand the effects of the interaction loss, we generated perturbations by exclusively using the interaction loss (without the classification loss). We used the RN-34 and DN-121 as source DNNs and tested the transferability on seven target DNNs. The attacks were conducted with 100 steps 3 on ImageNet validation images. Figure 3 (b) shows the curve of the transferability in different epochs. We compared such adversarial perturbations with noise perturbations generated as • sign(noise), where noise ∼ N (0, σ 2 I), and = 16/255, which was the same as the value used in the L ∞ attack. We found that perturbations generated by only using the interaction loss still exhibited moderate adversarial transferability. This phenomenon may be explained as that such perturbations decrease most interactions in the DNN, thereby damaging the inference patterns in the input image.

6. CONCLUSION

In this paper, we have analyzed the transferability of adversarial perturbations from the perspective of interactions based on game theory. We have proved that the multi-step attack tends to generate adversarial perturbations with large interactions. We have discovered and partially proved the negative correlation between the transferability and interactions inside adversarial perturbations. I.e. adversarial perturbations with higher transferability usually exhibit more negative interactions. We have proved that some classical methods of enhancing the transferability essentially decrease interactions between perturbation units, which provides a unified view to understand the enhancement of transferability. Moreover, we have proposed a new loss to directly penalize interactions between perturbation units during attacking, which significantly improves the transferability of previous methods. Furthermore, we have found that adversarial perturbations generated only using the interaction loss without the classification loss still exhibited moderate transferability, which provides a new perspective to understand the transferability of adversarial perturbations.

A MOTIVATIONS FOR USING THE SHAPLEY INTERACTION INDEX

In this section, we discuss the motivations of using the Shapley interaction index to define the interaction.

A.1 FOUR PROPERTIES OF SHAPLEY VALUES

Let Ω = {1, 2, . . . , n} denote the set of all players, and the reward function is v. Without ambiguity, we use φ(i|Ω) to denote the Shapley value of the player i in the game with all players Ω and reward function v, which is given as follows. φ(i|Ω) = S⊆Ω\{i} |S|! (n -|S|-1)! n! (v(S ∪ {i}) -v(S)). The Shapley value satisfies the following four properties (Weber, 1988 ): • Linearity property: If there are two games and the corresponding reward functions are v and w, i.e. v(S) and w(S) measure the reward obtained by players in S in these two games. Let φ v (i|Ω) and φ w (i|Ω) denote the Shapley value of the player i in the game v and game w, respectively. If these two games are combined into a new game, and the reward function becomes reward(S) = v(S) + w(S), then the Shapley value comes to be φv+w(i|Ω) = φv(i|Ω) + φw(i|Ω) for each player i in Ω. • Dummy property: A player i ∈ Ω is referred to as a dummy player if ∀S ⊆ Ω\{i}, v(S ∪ {i}) = v(S) + v({i}). In this way, φ(i|Ω) = v({i}) -v(∅), which means that player i plays the game independently. • Symmetry property: If ∀S ⊆ Ω \ {i, j}, v(S ∪ {i}) = v(S ∪ {j}) , then Shapley values of player i and j are equal, i.e. φ(i|Ω) = φ(j|Ω) . • Efficiency property: The sum of each individual's Shapley value is equal to the reward won by the coalition N , i.e. i φ(i|Ω) = v(Ω) -v(∅). This property guarantees the overall reward can be allocated to each player in the game.

A.2 MOTIVATIONS

Theoretical rigor. We use the Shapley interaction index defined based on the Shapley value, because the Shapley value has a solid theoretical foundation in the game theory, which is the unique attribution satisfying the above four desirable axioms. Whether the metric depends on network architectures. Because adversarial transferability is a general property for the attack, a convincing metric for adversarial transferability is supposed not to be directly related to the network architecture. To this end, the computation of the interaction defined on the Shapley value does not depend on the network architecture. In comparison, previous definitions of the interaction are usually oriented to model architectures. For example, the interaction proposed by Tsang et al. (2018) requires the DNN to be fully-connected. The two interaction metrics proposed by Murdoch et al. (2018) and Jin et al. (2020) are designed for LSTMs. The Hessian-based interaction (Janizek et al., 2020) requires the DNN to use the softPlus operation to replace the ReLU operation. Computational cost. The computational cost of the Shapley-based interaction-reduction loss is relatively low. Because of the efficiency axiom of the Shapley value, we prove that the time cost of computing the interaction loss interaction = 1 n-1 E i [v(Ω) -v(Ω\{i}) -v({i}) + v(∅)] is linear, i.e. O(n), where n is the dimension of features. The linear complexity makes it possible to apply the interaction to high-dimensional data and deep neural networks. In contrast, the complexity of computing all possible pairwise interactions defined in (Daria Sorokina, 2008) is O(n 2 ).

B COMPARISONS BETWEEN INTERACTIONS INSIDE PERTURBATIONS OF DIFFERENT ATTACKS

We have theoretically proved that some classical attacking methods of boosting the adversarial transferability essentially decrease interactions inside perturbations. Besides the theoretical proof in Ap- pendix I and Appendix H, we also conduct experiments to compare interactions of perturbation units when we generate adversarial perturbations with and without these attacking methods. Such experiments further verify that these methods of boosting the transferability essentially decrease interactions. We conduct attacks with the validation set in the ImageNet dataset on four DNNs, and measure the average interaction inside perturbation units. As Table 5 shows, the SGM Attack and the VR Attack decrease interactions inside perturbations.

C ADVERSARIAL ATTACK

In general, the objective of adversarial attacking can be formulated as the following optimization problem. maximize δ (h(x + δ), y) s.t. δ p ≤ , x + δ ∈ [0, 1] n , where (h(x + δ), y) is the classification loss. There are many ways to solve the above optimization problem under different norm constraints • p (Goodfellow et al., 2014; Carlini & Wagner, 2017; Kurakin et al., 2017; Madry et al., 2018; Chen et al., 2018b; Wang et al., 2019) . Optimization-based approach. One approach to approximately solving Equation ( 7) is to solve the following relaxed form: minimize δ {-(h(x + δ), y) + c • δ p } s.t. x + δ ∈ [0, 1] n , where c > 0 is a scalar constant to balance the classification loss and the norm constraint. Szegedy et al. (2013) ; Carlini & Wagner (2017) have demonstrated the effectiveness of this method. Projected gradient descent (PGD) (Madry et al., 2018) . The PGD Attack is usually considered as one of the simplest and the most widely used baseline for adversarial attacking. In this paper, this method is called the Baseline. The PGD Attack directly optimizes the classification loss in Equation ( 7). Considering the norm constraint, after each step of updating, the PGD Attack projects the adversarial perturbation δ back to the -ball, if the perturbation goes beyond the ball. PGD updates adversarial perturbations in each step with the following equation: δ t+1 =    Π (∞) (δ t + α • sign (∇ (h (x + δ t ) , y))) , p = +∞ Π (2) δ t + α • ∇ (h(x+δ t ),y) ∇ (h(x+δ t ),y) 2 , p = 2, where δ t denotes the perturbation of the t-th step. Π (∞) and Π (2) are projection operations, which project the perturbation δ back to the -ball, if the perturbation goes beyond the ball. α is the step size. Given δ ∈ R n , we have: Π (∞) (δ i ) = • sign(δ i ), if |δ i |> δ i , if |δ i |≤ , Π (2) (δ) = δ δ 2 , if δ 2 > δ, if δ 2 ≤ .

D EQUIVALENT FORMS OF THE INTERACTION

In Section 3.1, the interaction between units i, j is defined as the additional contribution as follows. I ij (δ) = φ(S ij |Ω ) -[φ(i|Ω \ {j}) + φ(j|Ω \ {i})] , where φ(S ij |Ω ) denotes the joint contribution of i, j, when perturbation units i, j are regarded as a singleton unit S ij = {i, j}, as follows. φ(S ij |Ω ) = S⊆Ω\{i,j} |S|! (n -|S|-2)! (n -1)! (v(S ∪ {i, j}) -v(S)), where S ij = {i, j} represents the coalition of perturbation units i, j. In this game, because perturbation units i, j are regarded as a singleton player, we can consider there are only n -1 players in the game, and consequently the set of players changes to Ω = Ω \ {i, j} ∪ S ij . φ(i|Ω \ {j}) and φ(j|Ω \ {i}) represent the individual contributions of units i and j, respectively, when the perturbation units i, j work individually. The individual contribution of perturbation unit i, when perturbation unit j is absent, is given as follows. φ(i|Ω \ {j}) = S⊆Ω\{i,j} |S|! (n -|S|-2)! (n -1)! (v(S ∪ {i}) -v(S)). In this game, because the perturbation unit j is always absent, we can consider there are only n -1 players in the game. Consequently the set of players changes to Ω \ {j}. Similarly, the individual contribution of perturbation unit j, when perturbation unit i is absent, is given as follows. φ(j|Ω \ {i}) = S⊆Ω\{i,j} |S|! (n -|S|-2)! (n -1)! (v(S ∪ {j}) -v(S)). In Section 1, the interaction between perturbation units δ i , δ j is defined as the change of the importance φ i of the i-th unit when the j-th unit δ j is perturbed w.r.t the case when the j-th unit δ j is not perturbed. If the perturbation δ j on the j-th unit increases the importance φ i of the i-th unit, then there is a positive interaction between δ i and δ j . If the perturbation δ j decreases the importance φ i , it indicates a negative interaction. Mathematically, this definition can be written as follows. I ij (δ) = φ i,w/ j -φ i,w/o j , where φ i,w/ j represents the importance of δ i , when δ j is always present; φ i,w/o j represents the importance of δ i , when δ j is always absent. When perturbation unit j is always present, the contribution of perturbation unit i is given as follows. φ i,w/ j = S⊆Ω\{i,j} |S|! (n -|S|-2)! (n -1)! (v(S ∪ {i, j}) -v(S ∪ {j})). In this game, because the perturbation unit j is always present, we can consider there are only n -1 players. When perturbation unit j is always absent, the contribution of perturbation unit i is given as follows. φ i,w/o j = S⊆Ω\{i,j} |S|! (n -|S|-2)! (n -1)! (v(S ∪ {i}) -v(S)). In this game, because the perturbation unit j is always absent, we can consider there are only n -1 players. The interaction in Equation ( 11) is equal to the interaction in Equation ( 12), i.e. I ij (δ) = I ij (δ)

E PROOF OF PROPOSITION 1

To simplify the problem setting, we do not consider some tricks in adversarial attacking, such as gradient normalization and the clip operation. In multi-step attacking, the final perturbation generated after t steps is given as follows. δ t multi def = α t-1 t =0 ∇ x (h(x + δ t multi ), y), where α represents the step size, and (h(x), y) is referred as the classification loss. To simplify the notation, we use g(x) to denote ∇ x (h(x), y), i.e. g(x) def = ∇ x (h(x), y). Furthermore, we define the update of the perturbation with the multi-step attack at each step t as follows. ∆x t multi def = α • g(x + δ t-1 multi ). In this way, the perturbation can be written as follows. δ t multi = ∆x 1 multi + ∆x 2 multi + • • • + ∆x t-1 multi . ( ) Lemma 1. Given the sample x ∈ R n and the adversarial perturbation δ ∈ R n , we use Ω = {1, 2, . . . , n} to denote the set of all perturbation units. The score function is denoted by v(S) = L(x + δ (S) ), where δ (S) satisfies ∀i ∈ S, δ (S) i = δ i ; ∀i / ∈ S, δ = 0. The Shapley interaction between perturbation units a, b can be written as I ab = δ a H ab (x)δ b + R2 (δ), where H ab (x) = ∂L(x) ∂xa∂x b represents the element of the Hessian matrix, and R2 (δ) denotes terms with elements in δ of higher than the second order. Proof. The Shapley interaction between perturbation units a, b is I ab (δ) = S⊆Ω\{a,b} |S|! (n -|S|-2)! (n -1)! [v(S ∪ {a, b}) -v(S ∪ {b}) -v(S ∪ {a}) + v(S)], where v(S) = L(x + δ (S) ). Here, the classification loss can be approximated as L(x + δ) = L(x) + g T (x)δ + 1 2 δ T H(x)δ + R 2 (δ) using Taylor series. Thus, ∀S ⊆ Ω, v(S ) = L(x) + a∈S g a (x)δ a + 1 2 a,b∈S δ a H ab (x)δ (S ) b + R S 2 (δ). where R S 2 (δ) denotes terms with elements in δ (S ) of higher than the second order. In this way, the Shapley interaction I ab is given as I ab (δ) = S⊆Ω\{a,b} |S|! (n -|S|-2)! (n -1)! [v(S ∪ {a, b}) -v(S ∪ {b}) -v(S ∪ {a}) + v(S)] = S⊆Ω\{a,b} |S|! (n -|S|-2)! (n -1)! {[L(x)+ a ∈S∪{a,b} g a (x)δ a + 1 2 a ,b ∈S∪{a,b} δ a H a b (x)δ b + R (S∪{a,b}) 2 (δ)] -[L(x) + a ∈S∪{b} g a (x)δ a + 1 2 a ,b ∈S∪{b} δ a H a b (x)δ b + R (S∪{b}) 2 (δ)] -[L(x) + a ∈S∪{a} g a (x)δ a + 1 2 a ,b ∈S∪{a} δ a H a b (x)δ b + R (S∪{a}) 2 (δ)] + L(x) + a ∈S g a (x)δ a + 1 2 a ,b ∈S δ a H a b (x)δ b + R (S) 2 (δ)} = S⊆Ω\{a,b} |S|! (n -|S|-2)! (n -1)! {δ a H ab (x)δ b } + S⊆Ω\{a,b} |S|! (n -|S|-2)! (n -1)! [R (S∪{a,b} 2 (δ)) -R (S∪{a}) 2 (δ) -R (S∪{b}) 2 (δ) + R (S) 2 (δ)] R2(δ) =        n-2 s=0 S⊆Ω\{a,b}, |S|=s s! (n -s -2)! (n -1)! [δ a H ab (x)δ b ]        + R2 (δ) = n-2 s=0 (n -2)! s! (n -s -2)! s! (n -s -2)! (n -1)! [δ a H ab (x)δ b ] + R2 (δ) = δ a H ab (x)δ b + R2 (δ), where R2 (δ) denotes terms with elements in δ of higher than the second order. Lemma 2. The update of the perturbation with the multi-step attack at step t defined in Equation ( 13) can be written as ∆x t multi = α [I + αH(x)] t-1 g(x) + Rt 1 , where g(x) def = ∇ x (h(x), y) represents the gradient, and H(x) def = ∇ 2 x (h(x), y) represents the Hessian matrix. Rt 1 denotes terms with elements in δ t-1 multi of higher than the first order. Proof. If t = 1, ∆x 1 multi = α • g(x). Let ∀t < t, ∆x t multi = α [I + αH(x)] t -1 g(x) + Rt 1 , then we have ∆x t multi = α • g(x + δ t-1 multi ) // According to Equation (13) = α • g(x + ∆x 1 multi + ∆x 2 multi + • • • + ∆x t-1 multi ) // According to Equation (14) = α •g x+α I + [I +αH(x)] +[I +αH(x)] 2 +• • •+ [I +αH(x)] t-2 g(x) + t-1 t =1 Rt 1 , where Rt 1 denotes terms of elements in δ t -1 multi of higher than the first order. Using the Taylor series, we get ∆x t multi = α • g(x) + α 2 H(x)T (x) + αH(x) t-1 t =1 Rt 1 + R t-1 1 Rt 1 , ( ) where R t-1 1 denotes terms with elements δ t-1 multi of higher than the first order. T (x) in Equation ( 15) is given as follows. T (x) = I + [I + αH(x)] + [I + αH(x)] 2 + • • • + [I + αH(x)] t-2 g(x). Multiply (I + αH(x)) on both sides of Equation ( 16), and we get (I + αH(x))T (x) = α • [I + αH(x)] + [I + αH(x)] 2 + • • • + [I + αH(x)] t-1 g(x). (17) Then, according to Equation ( 17) and Equation ( 16), we get H(x)T (x) = [I + αH(x)] t-1 -I g(x). ( ) Substituting Equation ( 18) back to Equation ( 15), we have ∆x t multi = α [I + αH(x)] t-1 g(x) + Rt 1 . In this way, we have proved that ∀t ≥ 1, ∆x t multi = α [I + αH(x)] t-1 g(x) + Rt 1 . Proposition 1. The adversarial perturbation generated by the multi-step attack via gradient descent is given as δ m multi = α m-1 t=0 ∇ x (h(x + δ t multi ), y) , where δ t multi denotes the perturbation after the t-th step of updating, and m is referred to as the total number of steps. The adversarial perturbation generated by the single-step attack is given as δ single = αm∇ x (h(x), y). The expectation of interactions between perturbation units in δ m multi , E a,b [I ab (δ m multi )], is larger than E a,b [I ab (δ single )], i.e. E a,b [I ab (δ m multi )] ≥ E a,b [I ab (δ single )].

E.1 FAIRNESS OF COMPARISONS OF INTERACTIONS INSIDE DIFFERENT PERTURBATIONS

Proposition 1 is valid for different loss functions of generating of adversarial perturbations. In this section, we discuss the fairness of comparisons of interactions inside different perturbations. When we compare interactions inside different perturbations, magnitudes of these perturbations should be similar, because the comparison of interactions between adversarial perturbations of different magnitudes is not fair. For fair comparisons, in Section 3.1, this paper controls the magnitude of the single-step attack by setting the step size of the single-step attack as αm, where α and m denotes the step size and the total number of steps of the multi-step attack, respectively. The equivalent step size αm makes the magnitude of perturbations generated by the single-step attack to be similar to that of perturbations generated by the multi-step attack, when we use the target score before the softmax layer to generate adversarial perturbations, such as ˜ (h(x), y) = max y =y h(x) -h y (x). In this case, the magnitude of the gradient ∇ x ˜ (h(x), y) is relatively stable. In particular, this type of loss has been widely used. For example, one of the most widely used attacking (Carlini & Wagner, 2017) , uses the score before the softmax layer for targeted attacking.

E.2 PROOF OF PROPOSITION 1

Proof. According to Lemma 2, the update of the perturbation with the multi-step attack at the step t is given as follows. ∆x t multi = α [I + αH(x)] t-1 g(x) + Rt 1 , where Rt 1 denotes terms with elements in δ t-1 multi of higher than the first order, and α represents the step size. To simplify the notation without causing ambiguity, we write g(x) and H(x) as g and H, respectively. In this way, according to Equation ( 14) and Equation ( 19), δ m multi can be written as follows. δ m multi = α I + [I + αH] + [I + αH] 2 + • • • + [I + αH] m-1 g + m t=1 Rt 1 = α mI + αm(m -1) 2 H + . . . g + m t=1 Rt 1 , where m represents the total number of steps. According to Lemma 1, the Shapley interaction between perturbation units a, b in δ m multi is given as follows. I ab (δ m multi ) = δ m multi,a H ab δ m multi,b + R2 (δ m multi ), where R2 (δ m multi ) denotes terms with elements in δ m multi of higher than the second order. According to Equation (20) and Equation ( 21), we have I ab (δ m multi ) = H ab αmg a + α 2 m(m -1) 2 n b =1 (H ab g b ) + • • • + m t=1 o(δ t multi,a ) terms of δ t multi,a of higher than the first order, which corresponds to the term of Rt 1 in Equation ( 20) αmg b + α 2 m(m -1) 2 n a =1 (H a b g a ) + • • • + o(δ t multi,b ) terms of δ t multi,b of higher than the first order, which corresponds to the term of Rt 1 in Equation ( 20) + R2 (δ m multi ) = α 2 m 2 g a g b H ab first-order terms w.r.t. elements in H + α 3 (m -1)m 2 2 g b n b =1 (H ab g b ) + α 3 (m -1)m 2 2 g a n a =1 (H a b g a ) H ab second-order terms w.r.t. elements in H + α 4 (m -1) 2 m 2 4 n b =1 (H ab g b ) n a =1 (H a b g a )H ab + . . . R multi 2 (H) + [ m t=1 o(δ t multi,a )]H ab δ m multi,b + [ m t=1 o(δ t multi,b )]H ab δ m multi,a + R2 (δ m multi ) R 2 (δ m multi ) = α 2 m 2 g a g b H ab + α 3 (m -1)m 2 2 g a H ab n a =1 (H a b g a ) + α 3 (m -1)m 2 2 g b H ab n b =1 (H ab g b ) + R 2 (δ m multi ) + R multi 2 (H), where R multi 2 (H) represents terms with elements in H of higher than the second order, and R 2 (δ m multi ) represents terms with elements in δ m multi of higher than the second order. Let us consider the single-step attack. When we compare interactions inside different perturbations, magnitudes of these perturbations should be similar, because the comparison of interactions between adversarial perturbations of different magnitudes is not fair. For fair comparisons, in Section 3.1, this paper controls the magnitude of the single-step attack, as follows. The single-step attack only uses the gradient information on the original input x, which generates adversarial perturbations as: δ single = αmg. Therefore, according to Lemma 1, the interaction between perturbation units a, b of δ single is given as follows. I ab (δ single ) = δ single,a H ab δ single,b + R2 (δ single ) = m 2 α 2 g a g b H ab + R2 (δ single ), where R2 (δ single ) denotes terms with elements in δ single of higher than the second order. In this way, according to Equation ( 22) and Equation ( 23), the expectation of the difference between I ab (δ m multi ) and I ab (δ single ) is given as follows.  (x) = 1 β log(1 + e -βx ). We train VGG-16, ResNet-32, and DensetNet-121 on the CIFAR-10 dataset (Krizhevsky et al., 2009) , and use the cross-entropy loss as the classification loss. E a,b [I ab (δ m multi ) -I ab (δ single )] = E a,b α 3 (m -1)m 2 2 g a H ab n a =1 (H a b g a ) + α 3 (m -1)m 2 2 g b H ab n b =1 (H ab g b ) + R 2 (δ m multi ) + R multi 2 (H) -R2 (δ single ) = α 3 (m -1)m 2 2 E a,b       g a H ab n a =1 (H a b g a ) U ab + g b H ab n b =1 (H ab g b ) U ba       + E a,b [R ab ] , where R ab = R 2 (δ m multi ) + R multi 2 (H) -R2 (δ single ). Assumption 1: Magnitudes of elements in the Hessian matrix H(x) is small that |H ab (x)| 1, where 1 ≤ a, b ≤ n. Therefore, H k (x) ≈ 0, if k > 2. We verify the assumption by directly measuring the value of H ab (x). As Figure 4 (a) shows, the value of H ab (x) is very small that |H ab (x)| 1. According to Assumption 1, we have R multi 2 (H) ≈ 0. Note that the magnitude of δ m multi and the magnitude of δ single are small, then R 2 (δ m multi ) ≈ 0, and R 2 (δ single ) ≈ 0. In this way, we have E a,b [R ab ] = E a,b [ R 2 (δ m multi ) + R 2 (δ single ) + R multi 2 (H)] ≈ 0. Moreover, for the expectation of U ab , we have  E a,b [U ab ] = 1 n(n -1) n b=1 a =b n b=1               n a=1 g a H ab A -g b H bb B      n a =1 g a H a b A          Let us focus on terms of A and B. Note that A is the sum of n terms (n is large). In comparisons, B is just a single term in A. Therefore, the sign of A -B is usually dominated by the term A. In this way, we get  P rob [sign(A -B) = sign(A)] ≈ 1. Therefore, P rob [(A -B)A ≥ 0] ≈ 1. U ba ] = E a,b [U ab ]. Therefore, E a,b [I ab (δ multi ) -I ab (δ single )] = α 3 (m -1)m 2 2 E a,b [U ab + U ba ] + E a,b [R ab ] ≈ α 3 (m -1)m 2 E a,b [U ab ] + 0 ≥ 0.

E.3 VERIFICATION OF PROPOSITION 1

We verify that perturbations generated by the multi-step attack tend to exhibit larger interaction than those generated by the single-step attack by measuring the value of E b [I ab ]. As shown in Appendix F, we prove that E b [I ab ] = v(Ω) -v(Ω \ {a}) -v({a}) + v(∅). Because the image data is high-dimensional, the cost of computing E b [I ab ] is high. As Appendix J.1 demonstrates, given the input image, we can measure the interaction at the grid level, instead of the pixel level, to reduce the computational cost. Therefore, we divide the input image into 16×16 (L = 16) grids, and use Equation ( 39) to compute the interaction as E (p ,q ) I (p,q),(p q ) (δ) = v(Λ) -v(Λ \ {Λ pq }) - v({Λ pq }) + v(∅) , where (p, q) denotes the coordinate of a grid. The experiments were conducted with ImageNet validation images on ResNet-32 and DenseNet-121. For fair comparisons, the magnitude of perturbations generated by the single-step attack is controlled to be same as that generated by the multi-step attack. As Figure 5 (left) shows, perturbations generated by the multi-step attack tend to exhibit larger interaction than those generated by the single-step attack.

E.4 PERTURBATIONS GENERATED BY THE MULTI-STEP ATTACK TEND TO EXHIBIT LARGER INTERACTION THAN GAUSSIAN NOISE

Moreover, we compare the interaction inside perturbation units generated by the multi-step attack with the Gaussian noise perturbation. Similarly, for fair comparisons, the magnitude of the Gaussian noise is controlled to be similar to that generated by the multi-step attack. As Figure 5 (right) shows, perturbations generated by the multi-step attack tend to exhibit larger interaction than Gaussian noise.

F EXPECTATION OF THE SHAPLEY INTERACTION

In Equation ( 3), the Shapley interaction between two perturbation units i, j is given as follows. I ij (δ) = φ(S ij |Ω \ {i, j} ∪ S ij ) -(φ(i|Ω \ {j}) + φ(j|Ω \ {i})) , where φ(S ij |Ω \ {i, j} ∪ S ij ) is the Shapley value of the singleton unit S ij = {i, j}, when perturbation units i, j form a coalition. φ(i|Ω \ {j}) and φ(j|Ω \ {i}) are Shapley values of perturbation units i, j, when these two perturbation units work individually. In this way, we can write the Shapley interaction in a closed form as follows. I ij (δ) = S⊆Ω\{i,j} |S|! (n -|S|-2)! (n -1)! [v(S ∪ {i, j}) -v(S ∪ {j}) -v(S ∪ {i}) + v(S)], where ∀S ⊆ Ω, v(S) = max y =y h (s) y (x + δ (S) ) -h (s) y (x + δ (S) ). The expectation of interaction is given as follows. E i,j [I ij (δ)] = 1 n -1 E i [v(Ω) -v(Ω \ {i}) -v({i}) + v(∅)] , which is proved as follows. Proof. As proved in Appendix D, I ij (δ) = I ij (δ). Therefore, the interaction between players i and j is given as follows. I ij (δ) = S⊆Ω\{i,j} |S|! (n -|S|-2)! (n -1)! [ [v(S ∪ {i, j}) -v(S ∪ {j})] -[v(S ∪ {i}) -v(S)] ] = φ j,w/ i -φ j,w/o i . The expectation of the interaction can be written as follows. E i,j [I ij (δ)] = 1 (n -1) E i    j∈Ω\{i} φ j,w/ i -φ j,w/o i    . According to the efficiency property of Shapley values (please refer to Appendix A.1 for details): j∈Ω\{i} φ j,w/ i = v(Ω) -v{i} j∈Ω\{i} φ j,w/o i = v(Ω \ {i}) -v(∅). In this way, E i,j [I ij (δ)] = 1 n -1 E i [v(Ω) -v(Ω \ {i}) -v({i}) + v(∅)] .

G DETAILS OF OBSERVING THE NEGATIVE CORRELATION BETWEEN THE TRANSFERABILITY AND THE INTERACTION

In Section 3.2, we directly measure the transfer utility and interactions of different adversarial perturbations. Here, we give more details of the experiments. We measure the transfer utility as Transfer Utility = [max y =y h (t) y (x + δ) -h (t) y (x + δ)] -[max y =y h (t) y (x) -h (t) y (x)]. We measure the interaction as E i,j [I ij (δ)] = 1 n-1 E i [v(Ω) -v(Ω \ {i}) -v({i}) + v(∅)] . As Appendix J.1 demonstrates, to reduce the computational cost, given the input image, we can measure the interaction at the grid level, instead of the pixel level. Therefore, we divide the input image into 16×16 (L = 16) grids, and use Equation ( 39) to compute the interaction as E (p,q),(p ,q ) I (p,q),(p q ) (δ) = 1 L 2 -1 E (p,q) [v(Λ) -v(Λ \ {Λ pq }) -v({Λ pq }) + v(∅)], where (p, q) denotes the coordinate of a grid. Using the validation set of the ImageNet dataset (Russakovsky et al., 2015) , we generate adversarial perturbations on four types of DNNs, including ResNet-34/152(RN-34/152) (He et al., 2016) and DenseNet-121/201(DN-121/201) (Huang et al., 2017) . We transfer adversarial perturbations generated on each ResNet to DenseNets. Similarly, we also transfer adversarial perturbations generated on each DenseNet to ResNets. Given an input image x, adversarial perturbations are generated using Equation ( 8 , where γ ∈ R is a constant. In our experiments, we set γ = 0.6. For fair comparisons, we need to ensure adversarial perturbations generated with different hyper-parameters c to be comparable with each other. Thus, we select a constant τ and let δ 2 = τ as the stopping criteria of all adversarial attacks. We set the number of steps as 1000. The threshold τ is set to ensure that attacks with different hyper-parameters τ are almost converged when the L 2 norm of the perturbation δ 2 reaches τ . Note that different attacking methods may successfully attack different sets of testing samples, so we select testing samples that can be successfully attacked by all attacking methods with different c k values (i.e. those having reached the stopping criteria under all attacks). The interaction and the transfer utility reported in Figure 1 are measured on the selected samples for fair comparisons.

H PROOF OF PROPOSITION 2

To simplify the problem setting, we do not consider some tricks in adversarial attacking, such as gradient normalization and the clip operation. In VR attack (Wu et al., 2018) , the final perturbation generated after t steps is given as follows. δ t vr def = α t-1 t =0 ∇ x ˆ (h(x + δ t vr ), y), where ˆ (h(x), y) = E ξ∼N (0,σ 2 I) [ (h(x + ξ), y)] . According to Equation ( 27), the gradient and the Hessian matrix of ˆ (h(x), y) is given as follows. ĝ(x) = ∇ x ˆ (h(x), y) = E ξ∼N (0,σ 2 I) [∇ x (h(x + ξ), y)] , Ĥ(x) = ∇ 2 x ˆ (h(x), y) = E ξ∼N (0,σ 2 I) ∇ 2 x (h(x + ξ), y) . ( ) where α represents the step size. Lemma 3. Given the Gaussian smoothed loss ˆ (x) = E ξ∼N (0,σ 2 I) [ (h(x) , y)], where (h(x), y) is the original classification loss, ∀a = b, ∀c = a, we have  and E x ĝa (x)ĝ c (x) Ĥab (x) Ĥcb (x) -g a (x)g c (x)H ab (x)H cb (x) = 0. E x ĝ2 a (x) Ĥ2 ab (x) -g 2 a (x)H 2 ab (x) ≤ 0, E x ĝa (x)ĝ b (x) Ĥab (x) -g a (x)g b (x)H ab (x) = 0, Proof. According to Equation (28), we have ĝa (x) = E ξ∼N (0,σ 2 I) [g a (x + ξ)] = E x ∼N (x,σ 2 I) [g a (x )] , Ĥab (x) = E ξ∼N (0,σ 2 I) ∂g a (x + ξ) ∂x b = E x ∼N (x,σ 2 I) ∂g a (x ) ∂x b = E x ∼N (x,σ 2 I) [H ab (x )]. This indicates that the gradient and the Hessian matrix in the VR attack are both smoothed by the Gaussian noise. Because the Lipschitz constants of g a (x) and H ab (x) are usually limited to a certain range, we can ignore the tiny probability of large gradients and large elements in the Hessian matrix, and roughly assume that g a (x) ∼ N (ĝ a (x), σ 2 ga ), and H ab (x) ∼ N ( Ĥab (x), σ 2 H ab ), where σ ga , σ H ab ∈ R are tow constants denoting the standard deviation. Thus, g a (x) and H ab (x) can be written as follows. g a (x) = ĝa (x) + ga , ga ∼ N (0, σ 2 ga ), H ab (x) = Ĥab (x) + H ab , H ab ∼ N (0, σ 2 H ab ). To simplify the notation without causing ambiguity, we write ĝ(x) and Ĥ(x) as ĝ and Ĥ, respectively. Moreover, we write g(x) and H(x) as g and H, respectively. In this way, we have  E x ĝ2 a Ĥ2 ab -g 2 a H 2 ab = E x E ga , H ab ĝ2 a Ĥ2 ab -(ĝ a + ga ) 2 ( Ĥab + H ab ) 2 = -E x E = -E x 0 • 2ĝ a Ĥab + E ga 0 • 2 Ĥab (ĝ a + ga ) 2 = 0. According to Equation (29), we have g a = ĝa + ga , g b = ĝb + g b . Thus, we have E x ĝa ĝb Ĥab -g a g b H ab = E x ĝa (x)ĝ b Ĥab -(ĝ a + ga )(ĝ b + g b )( Ĥab + H ab ) = -E x, ga , ga , H ab ga g b H + ga g b Ĥab + g b H ab ĝa + H ab ga ĝb + g b ĝb Ĥab + g b ĝa Ĥab + H ab ĝa ĝb = -E x E ga [ ga ]E g b [ g b ]E H ab [ H ] + E ga [ ga ]E g b [ g b ] Ĥab + E g b [ g b ]E H ab [ H ab ]ĝ a + E H ab [ H ab ]E ga [ ga ]ĝ b + E ga [ ga ]ĝ b Ĥab + E g b [ g b ]ĝ a Ĥab + E H ab [ H ab ]ĝ a ĝb = -E x 0 • 0 • 0 + 0 • 0 • Ĥab + 0 • 0 • ĝa + 0 • 0 • ĝb + 0 • ĝb Ĥab + 0 • ĝa Ĥab + 0 • ĝa ĝb = 0. Moreover, according to Equation ( 29 Proof. To simplify the notation without causing ambiguity, we write ĝ(x) and Ĥ(x) as ĝ and Ĥ, respectively. Moreover, we write g(x) and H(x) as g and H, respectively. = -E x E ga [ ga ]E gc [ gc ]E H ab [ H ab ]E H cb [ H cb ] + E gc [ gc ]E H ab [ H ab ]E H cb [ H cb ]ĝ a + E ga [ ga ]E H ab [ H ab ]E H cb [ H cb ]ĝ c + E ga [ ga ]E gc [ gc ]E H cb [ H cb ] Ĥab + E ga [ ga ]E gc [ gc ]E H ab [ H ab ] Ĥcb + E ga [ ga ]E gc [ gc ] Ĥab Ĥcb + E ga [ ga ]E H ab H ab ĝc Ĥcb + E ga [ ga ]E H cb [ H cb ]ĝ c Ĥab + E gc [ gc ]E H ab [ H ab ]ĝ a Ĥcb + E gc [ gc ]E H cb [ H cb ]ĝ a Ĥab + E H ab [ H ab ]E H cb [ H cb ]ĝ a ĝc + E ga [ ga ]ĝ c Ĥab Ĥcb + E gc [ gc ]ĝ a Ĥab Ĥcb + E H ab [ H ab ]ĝ a ĝc Ĥcb + E H cb [ H cb ]ĝ a ĝc Ĥab = -E x 0 + 0 • ĝa + 0 • ĝc + 0 • Ĥab + 0 • Ĥcb + 0 • Ĥab Ĥcb + 0 • ĝc Ĥcb + 0 • ĝc Ĥab + 0 • ĝa Ĥcb + 0 • ĝa Ĥab + 0 • ĝa ĝc + 0 • ĝc Ĥab Ĥcb + 0 • ĝa Ĥab Ĥcb + 0 • ĝa ĝc Ĥcb + 0 • ĝa ĝc Ĥab = 0. Just like the conclusion in Equation ( 22), we can write the interaction between δ m vr,a and δ m vr,b as follows. I ab (δ m vr ) = α 2 m 2 ĝa ĝb Ĥab + α 3 (m -1)m 2 2 ĝa Ĥab n a =1 ( Ĥa b ĝa ) + α 3 (m -1)m 2 2 ĝb Ĥab n b =1 ( Ĥab ĝb ) + Rvr 2 (δ m vr ) + R vr 2 ( Ĥ), where α denotes the step size, and m denotes the total number of steps. To enable fair comparisons, we use the same step size α and number of steps m as multi-step attack to make the magnitude of δ vr match the magnitude of δ multi . R vr 2 ( Ĥ) represents terms with elements in Ĥ of higher than the second order, and Rvr 2 (δ m vr ) represents terms with elements in δ m vr of higher than the second order. In this way, according to Equation (30) and Equation ( 22), the expectation of the difference between I ab (δ m vr ) and I ab (δ m multi ) is given as follows. E x E a,b [I ab (δ vr ) -I ab (δ multi )] = α 3 (m -1)m 2 2 E a,b E x ĝ2 a Ĥ2 ab -g 2 a H 2 ab + ĝ2 b Ĥ2 ab -g 2 b H 2 ab + E a,b E x [R vr ab ], where R vr ab = α 3 (m -1)m 2 2        a ∈{1,2,...,n}\{a} (ĝ a ĝa Ĥab Ĥa b -g a g a H ab H a b ) V ab + b ∈{1,2,...,n}\{b} (ĝ b ĝb Ĥab Ĥab -g b g b H ab H ab ) V ba        + α 2 m 2 ĝa ĝb Ĥab -g a g b H ab + Rvr 2 (δ m vr ) -R 2 (δ m multi ) + R vr 2 (H) -R multi 2 (H) The expectation of R vr ab is give as follows. E x E a,b [R vr ab ] = α 3 (m -1)m 2 2 E a,b E x [V ab ] + E x [V ba ] + 2 α(m -1) E x ĝa ĝb Ĥab -g a g b H ab + E x E a,b [ Rvr 2 (δ m vr ) -R 2 (δ m multi ) + R vr 2 (H) -R multi 2 (H)] ≈ 0. According to Assumption 1, we have R vr 2 (H) ≈ 0, and R multi 2 (H) ≈ 0. Note that the magnitude of δ m mi and the magnitude of δ multi are small, then R 2 (δ m vr ) ≈ 0, and R 2 (δ m multi ) ≈ 0. According to Lemma 3, we have E x ĝa ĝb Ĥab -g a g b H ab = 0, E x (ĝ a ĝa Ĥab Ĥa b -g a g a H ab H a b ) = 0. Therefore, we get E x [V ab ] = 0. In this way, E x E a,b [R vr ab ] ≈ 0. Furthermore, according to Lemma 3, we have E x ĝ2 a Ĥ2 ab -E x g 2 a H 2 ab ≤ 0. Therefore, E x E a,b [I ab (δ vr ) -I ab (δ multi )] = α 3 (m -1)m 2 2 E a,b E x ĝ2 a Ĥ2 ab -g 2 a H 2 ab + E x ĝ2 b Ĥ2 ab -g 2 b H 2 ab + E x E a,b [R vr ab ] ≈ α 3 (m -1)m 2 2 E a,b E x ĝ2 a Ĥ2 ab -g 2 a H 2 ab + E x ĝ2 b Ĥ2 ab -g 2 b H 2 ab + 0 ≤ 0 I PROOF OF PROPOSITION 3 To simplify the problem setting, we do not consider some tricks in adversarial attacking, such as gradient normalization and the clip operation. Note that the original MI Attack and the multi-step attack cannot be directly compared, since that magnitudes of the generated perturbations cannot be fairly controlled. The value of interactions is sensitive to the magnitude of perturbations. Comparing perturbations with different magnitudes is not fair. Thus, we slightly revise the MI Attack as g t mi def = µg t-1 mi + (1 -µ)∇ x (h(x + δ t-1 mi ), y), ) where t denotes the step and µ = (t -1)/t. (h(x), y) is referred as the classification loss. To simplify the notation, we use g(x) to denote ∇ x (h(x), y), i.e. g(x) def = ∇ x (h(x), y). In MI attack, the final perturbation generated after t steps is given as follows. δ t mi def = α t-1 t =0 g t mi , In this way, we get ∆x t mi = α • t -1 t I + α t -2 2 H(x) + R t-1 1 (H(x)) g(x) + 1 t I + H(x) α(t -1) + α (t -2)(t -1) 4 H(x) + t-1 t =1 R t 1 (H(x)) g(x) + t-1 t =1 Rt 1 + r t-1 1 = α •       t -1 t R t-1 1 (H(x)) + (t -2)(t -1) 4t H 2 (x) + 1 t H(x) t-1 t =1 R t 1 (H(x)) R t 1 (H(x)) +I + α t -1 2 H(x)+       g(x) + 1 t t-1 t =1 Rt 1 + r t-1 1 Rt 1 = α I + α t -1 2 H(x) + R t 1 (H(x)) g(x) + Rt 1 . where R t 1 (H(x)) denotes terms of elements in H(x) higher than the first order, and Rt 1 denotes terms with elements in δ t-1 mi of higher than the first order. In this way, we have proved that ∀t ≥ 1, ∆x t mi = α I + α t-1 2 H(x) + R t 1 (H(x)) g(x) + Rt 1 . Proposition 3. The adversarial perturbation generated by multi-step attack is denoted by δ m multi = α m-1 t=0 ∇ x (h(x + δ t multi ), y). The adversarial perturbation generated by multi-step attack incorporating the momentum is computed as δ m mi = α m-1 t=0 g t mi Perturbation units of δ m mi exhibit smaller interactions than δ m multi , i.e. E ij [I ij (δ m mi )] ≤ E ij [I ij (δ m multi )]. Proof. According to Lemma 4, the update of the perturbation with the MI attack at the step t is given as follows. ∆x t mi = α I + α t -1 2 H(x) + R t 1 (H(x)) g(x) + Rt 1 . ( ) where R t 1 (H(x)) denotes terms of elements in H(x) of higher than the first order. Rt 1 denotes terms with elements in δ t-1 mi of higher than the first order. To simplify the notation without causing ambiguity, we write g(x) and H(x) as g and H, respectively. In this way, according to Equation (33) and Equation ( 35), δ m mi can be written as follows. δ m mi = α mI + αm(m -1) 4 H + m t=1 R t 1 (H) g + m t=1 Rt 1 . ( ) where m represents the total number of steps. According to Lemma 1, the Shapley interaction between perturbation units a, b in δ m mi is given as follows. I ab (δ m mi ) = δ m mi,a H ab δ m mi,b + R2 (δ m mi ), where R2 (δ m mi ) denotes terms with elements in δ mi of higher than the second order. According to Equation (36) and Equation (37), we get I ab (δ m mi ) = H ab [αmg a + α 2 m(m -1) 4 n b =1 (H ab g b ) + • • • + m t=1 o(δ t mi,a ) terms of δ t mi,a of higher than the first order, which corresponds to the term of Rt 1 in Equation ( 36) ][ αmg b + α 2 m(m -1) 4 n a =1 (H a b g a ) + • • • + m t=1 o(δ t mi,b ) terms of δ t mi,b of higher than the first order, which corresponds to the term of Rt 1 in Equation ( 36) ] + R2 (δ m mi ) = α 2 m 2 g a g b H ab first-order terms w.r.t. elements in H + α 3 (m -1)m 2 4 g b n b =1 (H ab g b ) + α 3 (m -1)m 2 4 g a n a =1 (H a b g a ) H ab second-order terms w.r.t. elements in H + α 4 (m -1) 2 m 2 16 n b =1 (H ab g b ) n a =1 (H a b g a )H ab + . . . R mi 2 (H) + [ m t=1 o(δ t mi,a )]H ab δ m mi,b + [o(δ t mi,b )]H ab δ m mi,a + R2 (δ m mi ) R 2 (δ m mi ) = α 2 m 2 g a g b H ab + α 3 (m -1)m 2 4 g a H ab n a =1 (H a b g a ) + α 3 (m -1)m 2 4 g b H ab n b =1 (H ab g b ) + R 2 (δ m mi ) + R mi 2 (H), where R mi 2 (H) denotes terms of elements in H higher than the second order, and R 2 (δ m mi ) denotes terms of elements in δ m mi higher than the second order According to Equation ( 22) and Equation ( 38), the expectation of the difference between I ab (δ m mi ) and I ab (δ m multi ) is given as follows. E a,b [I ab (δ m mi ) -I ab (δ m multi )] = - α 3 (m -1)m 2 4 E a,b g a H ab n a =1 (H a b g a ) U ab + g b H ab n b =1 (H ab g b ) U ba + E a,b R mi ab , where R mi ab = R 2 (δ m mi ) -R 2 (δ m multi ) + R mi 2 (H) -R multi 2 (H). According to Assumption 1, we have R mi 2 (H) ≈ 0, and R multi 2 (H) ≈ 0. Note that the magnitude of δ m mi and the magnitude of δ m multi are small, then R 2 (δ m mi ) ≈ 0, and R 2 (δ m multi ) ≈ 0. Therefore, E a,b R mi ab = E a,b [ R 2 (δ m mi ) -R 2 (δ m multi ) + R mi 2 (H) -R multi 2 (H)] ≈ 0. Moreover, similar to Equa- tion (24) in the proof of Proposition 1, we have E a,b [U ab ] = E a,b [U ba ] ≥ 0. Published as a conference paper at ICLR 2021 Therefore, E a,b [I ab (δ mi ) -I ab (δ multi )] = - α 3 (m -1)m 2 4 E a,b [U ab + U ba ] + E a,b R mi ab ≈ - α 3 (m -1)m 2 2 E a,b [U ab ] + 0 ≤ 0. Note that Proposition 3 just shows the revised MI Attack usually decreases the interaction between perturbation units. The proof towards all types of MI Attacks is still a challenge.

J IMPLEMENTATION OF THE INTERACTION-REDUCED ATTACK (IR ATTACK)

J.1 GRID-LEVEL INTERACTIONS FOR IMAGE DATA Although the computation of E i,j [I ij (δ)] can be simplified using Equation ( 26), the computational cost of E i,j [I ij (δ)] is still high. Therefore, as Figure 6 shows, using the local property of images (Chen et al., 2018a) , we can divide the entire image into L × L grids, and compute interactions at the grid level, instead of the pixel level. Let Λ = {Λ 11 , Λ 12 , . . . , Λ LL } denote the set of grids. We use (p, q) to denote the coordinate of a grid. In this way, the expectation of interactions between perturbation grids is given as follows. E (p,q),(p ,q ) I (p,q),(p q ) (δ) = 1 L 2 -1 E (p,q) [v(Λ) -v(Λ \ {Λ pq }) -v({Λ pq }) + v(∅)] , Λ 𝑝𝑝𝑝𝑝 for the grid (p,q) Figure 6: For the input image, we can divide the image into grids, and compute interactions at the grid level.

J.2 SCALABILITY OF THE INTERACTION LOSS

In this section, we discuss about two kinds of scalability of the interaction loss. • Is the computational cost of the interaction loss affordable when the number of players is large? To this end, we have proved in Equation (4) that the computational complexity of the expectation of the interaction is linear, which is scalable. In fact, we do not directly compute interaction using Equation (3). Instead, we compute the expectation of interactions with Equation (4). The computational cost of the IR Attack can be further reduced by calculating the grid-level interactions of images. We further conducted experiments to measure the time cost of generating perturbations using the IR Attack. We conducted the IR attack for 100 steps on the ImageNet dataset. The time cost was measured using PyTorch 1.6 (Paszke et al., 2019) on Ubuntu 18.04, with the Intel(R) Core(TM) i7-9800X CPU @ 3.80GHz and a Titan RTX GPU. • Is the computation cost of the interaction loss affordable when we consider the continuous space of adversarial perturbations? It has been widely discussed (Ancona et al., 2019; Sundararajan & Najmi, 2019) that when applying the Shapley value, the feature space is regarded as binary. It is because as (Sundararajan & Najmi, 2019) shows that although there exist the Shapley-value-like attribution in a continuous space, only the Shapley value in the binary space is the unique attribution that satisfies the linearity axiom, the dummy axiom, the symmetry axiom, and the efficiency axiom that only in the binary space. Thus, when we compute the interaction, the perturbation can be regarded in the binary space, i.e., whether the perturbation unit is added to the input or not, which enables scalability. K EVALUATION OF THE TRANSFERABILITY VIA LEAVE-ONE-OUT VALIDATION As Figure 7 shows, the highest transferability of the MI Attack is achieved in an intermediate step, rather than in the last step. This phenomenon presents a challenge for fair comparisons of the transferability between different attacking methods. To this end, in order to enable fair comparisons of transferability between different methods, we estimate the adversarial perturbations with the highest transferability for each input image via the leave-one-out (LOO) validation as follows. Given a set of clean examples {(x i , y i )} N i=1 , where y i ∈ {1, 2, . . . , C}, we use x t i to denote the adversarial example at step t w.r.t. the clean example x i , where t ∈ {1, 2, . . . , T }, and T is the number of total step. Given a target DNN h(•) and an input example x, where h(•) denotes the output before the softmax layer, we use C(x) = arg max k h k (x), k ∈ {1, . . . , C} to denote the prediction of the example x. 

L ADDITIONAL RELATED WORK

Some studies paid attention to intermediate features to improve transferability. Activation Attack (Inkawhich et al., 2019) forced the intermediate features of the input image to be similar with the intermediate features of a target image, in order to generate highly transferable targeted example. Distribution Attack (Inkawhich et al., 2020) explicitly modeled the feature distribution of each class, and improve the targeted transferability by driving the feature of perturbed input image into the distribution of a specific target class. Intermediate Level Attack (Huang et al., 2019 ) improved the 2019), which uses random data augmentation during attacking, it is difficult to mathematically prove that they essentially reduce interactions. Nevertheless, as Table 11 shows, we empirically demonstrated that two widely-used transferability-boosting attacks, DI and TI (Dong et al., 2019) , also reduced interactions.

O ADDITIONAL EXPERIMENTS ON EFFECTS OF THE INTERACTION LOSS

We conducted additional experiments to test the effects of the interaction loss. We conducted attacks on two source DNNs (RN-34, DN-121), and transferred adversarial perturbations to seven target DNNs (VGG16, IncV3, IncV4, IncResV2) . We used the following two experimental settings to compare the transferability of adversarial perturbations generated with different λ values. First, we re-drew the curves in Figure 3 (a) by extending the λ from the range of [0, 1.2] to the range of [0, 2.0], in order to show the performance of different λ values. We simply changed the λ value in the objective function (i.e. Equation ( 5)) without any other revisions. This was the most direct way to test the effects of λ. Experimental results are shown in Figure 8 . Besides above experimental settings, we also compared adversarial perturbations generated with different λ values, when we controlled each perturbation to have the same attacking utility. The attacking utility was defined as follows. Attacking U tility = max where y denote the label of the input image x. This setting also ensured the fairness of comparisons from a new perspective. Please see Figure 9 for experimental results. In sum, under both experimental settings, we found that the large λ value usually yielded a high adversarial transferability in our experiments. 



We set p = 2 as the setting 1, and p = 5 as the setting 2. To this end, the performance of adversarial perturbations is not the key issue in the experiment. Instead, we just randomly set the p value to examine the trustworthiness of the negative correlation under various attacking conditions (even in extreme attacking conditions). Previous studies usually set the number of steps to 10 or 20. Here, we set the number of steps to 100 together with the leave-one-out validation for fair comparisons of different attacks. The TI Attack was designed oriented to the secured DNNs which were robustly trained via adversarial training. Thus, we applied the TI Attack to the secured models in Table3.



Figure 2: Visualization of interactions between neighboring perturbation units generated with and without the interaction loss. The color in the visualization is computed as color[i] ∝ E j∈Ni [I ij (δ)],where N i denotes the set of adjacent perturbation units of the perturbation unit i. Here, we ignore interactions between non-adjacent units to simplify the visualization. It is because adjacent units usually encode much more significant interactions than other units. The interaction loss forces the perturbation to encode more negative interactions.

Figure 3: (a) The success rates of black-box attacks with the IR Attack using different values of λ. The success rates increased, when the value of λ increased. (b) The transferability of adversarial perturbations generated by only using the interaction loss (without the classification loss). Such adversarial perturbations still exhibited moderate adversarial transferability. Points localized at the last epoch represent the transferability of noise perturbations as the baseline.

Figure 4: (a) Histograms of the value of the Hessian element H ab (x) w.r.t. different values of a, b. (b) Histograms of the value of g b H bb n a=1 gaH ab w.r.t. different values of b. Because the Hessian of the DNN with the ReLU activation is not well defined, we replace the ReLU activation with the Softplus activation f (x) = 1β log(1 + e -βx ). We train VGG-16, ResNet-32, and DensetNet-121 on the CIFAR-10 dataset(Krizhevsky et al., 2009), and use the cross-entropy loss as the classification loss.

Figure 5: Histograms of the value of E b [I ab ] w.r.t. different values of a

We verify this assumption by measuring the value of g b H bb n a=1 gaH ab . If P rob | g b H bb n a=1 gaH ab | 1 ≈ 1, then we have P rob [sign(A -B) = sign(A)] ≈ 1. As Figure 4 (b) shows, the value of g b H bb n a=1 gaH ab is very small that | g b H bb n a=1 gaH ab | 1. To this end, we have (A -B)B ≥ 0, and we get E a,b [U ab ] ≥ 0 (24) Due to the symmetry of a, b, we have E a,b [

), i.e. min δ -(h(x + δ), y) + c • δ p p s.t. x + δ ∈ [0, 1] n , where c ∈ R is a scalar constant. In this way, we gradually change the value of c as different hyper-parameters to generate different adversarial perturbations, i.e. c k = kβ + c 0 , where β ∈ R is a constant. Moreover, to ensure adversarial perturbations generated with different values of c k change smoothly, we use the perturbation generated with c k-1 to initialize the perturbation for c k , i.e. δ (c k ) init = γδ (c k-1 )

The adversarial perturbation generated by multi-step attack is denoted by δ m multi = α m-1 t=0 ∇ x (h(x + δ t multi ), y). The adversarial perturbation generated by VR Attack is denoted by δ m vr = α m-1 t=0 ∇ x ˆ (h(x + δ t vr ), y), where ˆ (h(x + δ t vr ), y) = E ξ∼N (0,σ 2 I) [ (h(x + δ t vr + ξ), y)]. Perturbation units of δ m vr tend to exhibit smaller interaction than δ m multi , i.e. E x E a, b [I ab (δ m vr )] ≤ E x E a, b [I ab (δ m multi )].

Figure 7: The curve of transferability in different steps.

y =y h y (x + δ) -h y (x + δ)

Figure 8: The success rates of black-box attacks with the IR Attack using different values of λ under the first experimental setting.

The success rates of L ∞ and L 2 black-box attacks crafted on six source models, including AlexNet, VGG16, RN-34/152, DN-121/201, against seven target models. Transferability of adversarial perturbations can be enhanced by penalizing interactions. PGD L∞+IR 84.0±0.5 84.7±2.3 88.5±0.9 64.4±1.6 56.9±3.1 59.3±4.3 49.2±1.1 PGD L∞+IR 85.0±0.3 84.8±0.4 95.1±0.2 70.3±1.7 61.1±2.5 62.1±2.0 53.5±0.3

The success rates of L ∞ black-box attacks crafted on the ensemble model (RN-34+RN-152+DN-121) against nine target models.

Transferability against the secured models: the success rates of L ∞ black-box attacks crafted on RN-34 and DN-121 source models against three secured models.

The success rates of L ∞ black-box attacks crafted by different methods on four source models (RN-34/152, DN-121/201) against seven target models. Transferability of adversarial perturbations can be enhanced by penalizing interactions.

The average interaction inside adversarial perturbations generated by different attacks.

ga , H ab H ab (ĝ a + ga ) 2 + 2 ga ĝa Ĥab + 2 H ab Ĥab (ĝ a + ga ) 2 ≤ -E x E ga , H ab 2 ga ĝa Ĥab + 2 H ab Ĥab (ĝ a + ga ) 2 = -E x E ga [ ga ] 2ĝ a Ĥab + E ga E H ab [ H ab ]2 Ĥab (ĝ a + ga ) 2

), we haveE x ĝa ĝc Ĥab Ĥcb -g a g c H ab H cb = E x, ga , gc , H ab , H cb ĝa ĝc Ĥab Ĥcb -(ĝ a + ga )(ĝ c + gc )( Ĥab + H ab )( Ĥcb + H cb ) = -E x,ga , gc , H ab , H cb ga gc H ab H cb + gc H ab H cb ĝa + ga H ab H cb ĝc + ga gc H cb Ĥab + ga gc H ab Ĥcb + ga gc Ĥab Ĥcb + ga H ab ĝc Ĥcb + ga H cb ĝc Ĥab + gc H ab ĝa Ĥcb + gc H cb ĝa Ĥab + H ab H cb ĝa ĝc + ga ĝc Ĥab Ĥcb + gc ĝa Ĥab Ĥcb + H ab ĝa ĝc Ĥcb + H cb ĝa ĝc Ĥab

Table 6 shows the average computational cost of generating adversarial perturbations on an input image with size 224 × 224 by the IR Attack for 100 steps. It shows that the IR Attack is computationally applicable to high-dimensional data and deep neural networks.

Average computational cost of generating adversarial perturbations over an input image by the IR Attack for 100 steps on different source DNNs.

ACKNOWLEDGMENTS

All members in Shanghai Jiao Tong university, including Xin Wang, Jie Ren, Shuyun Lin, Xiangming Zhu, and Dr. Quanshi Zhang are supported by National Natural Science Foundation of China (61906120 and U19B2043) and Huawei Technologies. Dr. Yisen Wang is partially supported by the National Natural Science Foundation of China under Grant 62006153, and CCF-Baidu Open Fund (OF2020002). Xin Wang is supported by Wu Wen Jun Honorary Doctoral Scholarship, AI Institute, Shanghai Jiao Tong University.

annex

where α represents the step size. Furthermore, we define the update of perturbation with the MI attack at each step t as follows.In this way, the perturbation can be written as follows.Lemma 4. The update of the perturbation with the MI attack at step t defined in Equation ( 32) can be written as ∆x t mi = α I + α t-1 2 H(x) + R t 1 (H(x)) g(x) + Rt 1 , where R t 1 (H(x)) denotes terms of elements in H(x) higher than the first order, and Rt 1 denotes terms with elements in δ t-1 mi of higher than the first order.Proof.According to Equation (31) and Equation (32), we haveApplying the Taylor series to the term of g(x + δ t-1 mi ), we get (34)where r t-1 1 denotes terms of elements in δ t-1 mi of higher than the first order. According to Equation (33) and Equation (34), we getAccording to Equation (32), we get g t-1transferability of an adversarial example by maximizing the feature perturbation of a pre-specified layer. In comparison, we explain and improve the transferability based on game theory. Moreover, we discover the negative correlation between the transferability and interactions.

M ADDITIONAL EXPERIMENTS ON INTERACTION-REDUCED LOSS M.1 INTERACTION REDUCTION ON OTHER ATTACKS

To further demonstrate the effectiveness of the interaction loss, we have applied the interaction loss on other attacks besides the PGD Attack, including the MI Attack, the SGM Attack, and the VR Attack. More specifically, we added the interaction loss on the MI Attack (namely the MI+IR Attack), the SGM Attack (namely the SGM+IR Attack), and the VR Attack (namely the VR+IR Attack), respectively.For the MI Attack and the SGM Attack, we directly applied Equation ( 7) to these attacks, because these attacks were compatible with the interaction loss. Besides, for the VR Attack, its objective function is given as follows.maximizeTherefore, the VR+IR Attack was implemented via sampling as follows.maximizewhere the interaction loss was computed by considering the input image as x + ξ k , rather than x in Equation ( 26). The VR Attack reported in Table 4 followed the original paper (Wu et al., 2018) to set K = 20. However, a crucial issue for applying the interaction loss to the VR attack was its extremely high computational cost. Therefore, for the implementation of the VR+IR Attack, we set K = 5 and reduce the number of steps from 100 to 50. Just like experiments in Table 1 , we also used the LOO strategy for evaluation.Table 7 , Table 8 , and Table 9 compare the success rates of attacks with and without the interaction loss. The results demonstrated that the performance of the MI Attack, the SGM Attack, and the VR Attack can be further enhanced by directly adding the interaction loss to reduce interactions inside perturbations. (Russakovsky et al., 2015) . To further demonstrate the broad applicability of such a negative correlation, we also conducted the targeted attack on the CIFAR-10 dataset (Krizhevsky & Hinton, 2009) to test the transferability of perturbations generated with the interaction loss. Following Wu et al. (2018) , we chose three DNNs as the source DNN or the target DNN, which included: LeNet (LeCun et al., 1998) , RN-20 (He et al., 2016) , and DN-121 (Huang et al., 2017) . We conducted the targeted attack under the L ∞ norm constraint, and chose he plane class as the target category. The norm constraint was set to 16/255, and the step size was set to 2/255. The transferability was computed based on the best adversarial perturbation during 50 steps via the leaveone-out (LOO) validation., which has been introduced in Appendix K.As Table 10 shows, the transferability could be enhanced by reducing interactions on the targeted attack on the CIFAR-10 dataset. Particularly, when the source DNN is RN-20 and the target DNN is DN-121, the transferability improvement was about 30%, which was a considerable gain. 

N EMPIRICAL VERIFICATION OF OTHER TRANSFERABILITY-BOOSTING ATTACKS

We have theoretically analyzed the MI Attack, the VR Attack, and the SGM Attack. However, for other methods of improving adversarial transferability, such as Diversity Input (DI) (Xie et al., 

