ILA-DA: IMPROVING TRANSFERABILITY OF INTER-MEDIATE LEVEL ATTACK WITH DATA AUGMENTATION

Abstract

Adversarial attack aims to generate deceptive inputs to fool a machine learning model. In deep learning, an adversarial input created for a specific neural network can also trick other neural networks. This intriguing property is known as black-box transferability of adversarial examples. To improve black-box transferability, a previously proposed method called Intermediate Level Attack (ILA) fine-tunes an adversarial example by maximizing its perturbation on an intermediate layer of the source model. Meanwhile, it has been shown that simple image transformations can also enhance attack transferability. Based on these two observations, we propose ILA-DA, which employs three novel augmentation techniques to enhance ILA. Specifically, we propose (1) an automated way to apply effective image transformations, (2) an efficient reverse adversarial update technique, and (3) an attack interpolation method to create more transferable adversarial examples. Shown by extensive experiments, ILA-DA greatly outperforms ILA and other state-of-the-art attacks by a large margin. On ImageNet, we attain an average attack success rate of 84.5%, which is 19.5% better than ILA and 4.7% better than the previous state-of-the-art across nine undefended models. For defended models, ILA-DA also leads existing attacks and provides further gains when incorporated into more advanced attack methods. The code is available at

1. INTRODUCTION

Recent studies (Szegedy et al., 2013; Goodfellow et al., 2015) showed that deep neural network (DNN) models are vulnerable to adversarial attacks, where perturbations are added to the clean data to fool the models in making erroneous classification. Such adversarial perturbations are usually crafted to be almost imperceptible by humans, yet causing apparent fluctuations in the model output. The effectiveness of adversarial attacks on deep learning models raises concerns in multiple fields, especially for security-sensitive applications. Besides being effective to the victim model, adversarial attacks are found to be capable of transferring across models (Papernot et al., 2016) . One explanation for this phenomenon is the overlapping decision boundaries shared by different models (Liu et al., 2017; Dong et al., 2018) . Such behavior not only aggravates concerns on the reliability and robustness of deep learning models, but also enables various black-box attacks which leverage the transferring behavior, such as directly generating attacks from a source (or surrogate) model (Zhou et al., 2018) or acting as a gradient prior to reduce the number of model queries (Guo et al., 2019) . Intermediate Level Attack (ILA) is a method proposed by Huang et al. (2019) to fine-tune an existing adversarial attack as a reference, thereby raising its attack transferability across different models. Formulated to maximize the intermediate feature map discrepancy represented in the models, ILA achieves remarkable black-box transferability, outperforming various attacks that are directly generated (Zhou et al., 2018; Xie et al., 2019a) . On the other hand, many of the transfer-based attacks empirically show that simple image transformations, including padding (Xie et al., 2019a) Our main contributions can be summarized as follows: • We propose ILA-DA, which applies three novel augmentation techniques, including automated data augmentation, reverse adversarial update and attack interpolation, to considerably strengthen the current ILA attack. • We demonstrate that ILA-DA outperforms various ILA attacks and other state-of-the-art approaches. On ImageNet, we attain an average attack success rate of 84.5%, which is 19.5% better than ILA and 4.7% better than the previous state-of-the-art across nine undefended models. • We show that ILA-DA with simple I-FGSM attack references can exceed state-of-the-art attacks on six defended models. We also find that incorporating ILA-DA into existing attacks can further increase their attack transferability. (Goodfellow et al., 2015) generates attacks by adding the signed gradient of the loss with respect to the image back to the image, obtaining a higher loss and possibly incorrect prediction. The Iterative Fast Gradient Sign Method (I-FGSM, also known as BIM) (Ku-



Figure 1: Visualization of the generated images among: clean image, I-FGSM, and I-FGSM + ILA-DA (Ours), with the perturbation budget ϵ = 16/255 (0.063).

White-box and black-box attacks are two common threat models used in adversarial attack research. The white-box setting assumes that the attacker has access to the victim model's internal state, including its gradient, parameters, training dataset, etc. The black-box setting, on the other hand, only allows model queries without accessing any subsequent model information. While more variations such as the grey-box and no-box settings exist, they are generally not considered in this work.Typical white-box attacks exploit the gradient of the model to generate adversarial examples. The Fast Gradient Sign Method (FGSM)

availability

https://github.com/argenycw/ILA

