IDENTIFICATION OF THE ADVERSARY FROM A SINGLE ADVERSARIAL EXAMPLE

Abstract

Deep neural networks have been shown vulnerable to adversarial examples. Even though many defence methods have been proposed to enhance the robustness, it is still a long way toward providing an attack-free method to build a trustworthy machine learning system. In this paper, instead of enhancing the robustness, we take the investigator's perspective and propose a new framework to trace the first compromised model in a forensic investigation manner. Specifically, we focus on the following setting: the machine learning service provider provides models for a set of customers. However, one of the customers conducted adversarial attacks to fool the system. Therefore, the investigator's objective is to identify the first compromised model by collecting and analyzing evidence from only available adversarial examples. To make the tracing viable, we design a random mask watermarking mechanism to differentiate adversarial examples from different models. First, we propose a tracing approach in the data-limited case where the original example is also available. Then, we design a data-free approach to identify the adversary without accessing the original example. Finally, the effectiveness of our proposed framework is evaluated by extensive experiments with different model architectures, adversarial attacks, and datasets.

1. INTRODUCTION

It has been shown recently that machine learning algorithms, especially deep neural networks, are vulnerable to adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2015) . That is, given a victim neural network model and a correctly classified example, an adversarial attack aims to compute a small perturbation such that the original example will be misclassified with this perturbation added. To enhance the robustness against attacks, many defence strategies have been proposed (Madry et al., 2018; Zhang et al., 2019; Cheng et al., 2020a) . However, they suffer from poor scalability and generalization on other attacks and trade-offs with test accuracy on clean data, making the robust models hard to deploy in real life. Therefore, in this paper, we turn our focus on the aftermath of adversarial attacks, where we take the forensic investigation to identify the first compromised model for generating the adversarial attack. In this paper, we show that given only a single adversarial example, we could trace the source model that adversaries based for conducting the attack. As shown in Figure 1 , we consider the following setting: a Machine Learning as a Service (MLaaS) provider will provide models for a set of customers. For the consideration of time-sensitive applications such as auto-pilot systems, the models would be distributed to every customer locally. The model architecture and weight details are encrypted and hidden from the customers for the consideration of intellectual property (IP) protection and maintenance. In other



Figure 1: Illustration of the threat model in owner-customer distribution setting where the attacker conducts adversarial attacks on the assigned model 2 and uses the generated adversarial examples to attack other users' models.

