A UNIFIED APPROACH TO INTERPRETING AND BOOST-ING ADVERSARIAL TRANSFERABILITY

Abstract

In this paper, we use the interaction inside adversarial perturbations to explain and boost the adversarial transferability. We discover and prove the negative correlation between the adversarial transferability and the interaction inside adversarial perturbations. The negative correlation is further verified through different DNNs with various inputs. Moreover, this negative correlation can be regarded as a unified perspective to understand current transferability-boosting methods. To this end, we prove that some classic methods of enhancing the transferability essentially decease interactions inside adversarial perturbations. Based on this, we propose to directly penalize interactions during the attacking process, which significantly improves the adversarial transferability. Our code is available online 1 .

1. INTRODUCTION

Adversarial examples of deep neural networks (DNNs) have attracted increasing attention in recent years (Ma et al., 2018; Madry et al., 2018; Wang et al., 2019; Ilyas et al., 2019; Duan et al., 2020; Wu et al., 2020b; Ma et al., 2021) . Goodfellow et al. (2014) found the transferability of adversarial perturbations, and used perturbations generated on a source DNN to attack other target DNNs. Although many methods have been proposed to enhance the transferability of adversarial perturbations (Dong et al., 2018; Wu et al., 2018; 2020a) , the essence of the improvement of the transferability is still unclear. This paper considers the interaction inside adversarial perturbations as a new perspective to interpret adversarial transferability. Interactions inside adversarial perturbations are defined using the Shapley interaction index proposed in game theory (Michel & Marc, 1999; Shapley, 1953) . Given an input sample x ∈ R n , the adversarial attack aims to fool the DNN by adding an imperceptible perturbation δ ∈ R n on x. Each unit in the perturbation map is termed a perturbation unit. Let φ i denote the importance of the i-th perturbation unit δ i to attacking. φ i is implemented as the Shapley value, which will be explained later. The interaction between perturbation units δ i , δ j is defined as the change of the i-th unit's importance φ i when the j-th unit is perturbed w.r.t the case when the j-th unit is not perturbed. If the perturbation δ j on the j-th unit increases the importance φ i of the ith unit, then there is a positive interaction between δ i and δ j . If the perturbation δ j decreases the importance φ i , it indicates a negative interaction. In this paper, we discover and partially prove a clear negative correlation between the transferability and the interaction between adversarial perturbation units, i.e. adversarial perturbations with lower transferability tend to exhibit larger interactions between perturbation units. We verify such a correlation based on both the theoretical proof and comparative studies. Furthermore, based on the correlation, we propose to penalize interactions during attacking to improve the transferability.

