IDENTIFICATION OF THE ADVERSARY FROM A SINGLE ADVERSARIAL EXAMPLE

Abstract

Deep neural networks have been shown vulnerable to adversarial examples. Even though many defence methods have been proposed to enhance the robustness, it is still a long way toward providing an attack-free method to build a trustworthy machine learning system. In this paper, instead of enhancing the robustness, we take the investigator's perspective and propose a new framework to trace the first compromised model in a forensic investigation manner. Specifically, we focus on the following setting: the machine learning service provider provides models for a set of customers. However, one of the customers conducted adversarial attacks to fool the system. Therefore, the investigator's objective is to identify the first compromised model by collecting and analyzing evidence from only available adversarial examples. To make the tracing viable, we design a random mask watermarking mechanism to differentiate adversarial examples from different models. First, we propose a tracing approach in the data-limited case where the original example is also available. Then, we design a data-free approach to identify the adversary without accessing the original example. Finally, the effectiveness of our proposed framework is evaluated by extensive experiments with different model architectures, adversarial attacks, and datasets.

1. INTRODUCTION

It has been shown recently that machine learning algorithms, especially deep neural networks, are vulnerable to adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2015) . That is, given a victim neural network model and a correctly classified example, an adversarial attack aims to compute a small perturbation such that the original example will be misclassified with this perturbation added. To enhance the robustness against attacks, many defence strategies have been proposed (Madry et al., 2018; Zhang et al., 2019; Cheng et al., 2020a) . However, they suffer from poor scalability and generalization on other attacks and trade-offs with test accuracy on clean data, making the robust models hard to deploy in real life. Therefore, in this paper, we turn our focus on the aftermath of adversarial attacks, where we take the forensic investigation to identify the first compromised model for generating the adversarial attack. In this paper, we show that given only a single adversarial example, we could trace the source model that adversaries based for conducting the attack. As shown in Figure 1 , we consider the following setting: a Machine Learning as a Service (MLaaS) provider will provide models for a set of customers. For the consideration of time-sensitive applications such as auto-pilot systems, the models would be distributed to every customer locally. The model architecture and weight details are encrypted and hidden from the customers for the consideration of intellectual property (IP) protection and maintenance. In other words, every customer could only access the input and output of the provided model but not the internal configurations. On the other side, the service provider has full access to every detail of their models, including the training procedure, model architecture, and hyperparameters. However, there exists a malicious user who aims to fool the system by conducting adversarial attacks and gaining profit from the generated adversarial examples. Since the models are trained for the same objective using the same dataset, adversarial examples generated by the adversary could be transferred to the other users' models with a very high probability, 100% if the models are the same. Thus it is critical for the interested party to conduct the investigation and trace the malicious user by identifying the compromised model. Taking the auto-pilot systems on the self-driving car as an example, the malicious user could conduct an adversarial attack on a road sign by querying his own vehicle model and then create an adversarial sticker to fool other vehicles using the same detection system. Only given adversarial examples as evidence, in order to make the tracing possible, adversarial examples generated by different models have to be unique so that we are able to find the source model and trace the malicious user in the end. To achieve this goal, we design a random mask watermarking strategy which embeds the watermark to the generated adversarial samples without sacrificing model performance. At the same time, the proposed strategy is efficient and scalable that only needs a few iterations of fine-tuning. In the presence of the original example, a high-accuracy tracing method is proposed, which compares the adversarial perturbation with every model's masked pattern and the adversarial example's output distribution among different models. Because it is not always practical to have the original example as a reference, in the second part, we further discuss the most challenging and practical attack setting where only the adversarial example is available for the investigator. Observing that the model's probability predictions on the same adversarial example would change significantly with a different watermark applied, we derive an effective rule to find the compromised model. Specifically, based on the property that adversarial example is not robust against noise, we redesign the tracing metric based on the change in the predicted probabilities when applying different watermarks, which we expect the compromised model to minimize. Comprehensive experiments are conducted on multiple adversarial attacks and datasets. When there is only a single adversarial example available, the results demonstrate that the two proposed methods could successfully trace the suspect model with over 74% accuracy on average with the data-limited case and data-free case. The tracing accuracy increases significantly to around 97% when there are two adversarial examples available.

Our contributions are summarized below:

• To the best of our knowledge, we are the first to propose a novel and scalable framework to make it possible to trace the compromised model by only using a single sample and its corresponding adversarial example. • In the absence of samples used to generate adversarial examples, we further utilize the prediction difference of each model with different watermarks to identify the adversary without any requirement on the original sample. • Extensive experiments were conducted to demonstrate the effectiveness of the proposed framework to trace the compromised model that the malicious users utilize to conduct different black-box adversarial attacks under different network architectures and datasets. We showed that the adversary could be traced with high accuracy in different scenarios, and the proposed framework has good scalability and efficiency.

2. RELATED WORK

Adversarial Attack Since the discovery of adversarial example (Szegedy et al., 2014) , many attack methods have been proposed. Roughly speaking, based on the different levels of information accessibility, adversarial attacks can be divided into white-box and black-box settings. In the whitebox setting, the adversary has complete knowledge of the targeted model, including the model architecture and parameters. Thus, back-propagation could be conducted to solve the adversarial object by gradient computation (Goodfellow et al., 2015; Kurakin et al., 2017; Madry et al., 2018; Carlini & Wagner, 2017) . On the other hand, the black-box setting has drawn much attention recently, where the attacker could only query the model but has no direct access to any internal information. Based on whether the model feedback would give the probability output, the attacks could be soft-



Figure 1: Illustration of the threat model in owner-customer distribution setting where the attacker conducts adversarial attacks on the assigned model 2 and uses the generated adversarial examples to attack other users' models.

