JUST HOW TOXIC IS DATA POISONING? A BENCH-MARK FOR BACKDOOR AND DATA POISONING AT-TACKS

Abstract

Data poisoning and backdoor attacks manipulate training data in order to cause models to fail during inference. A recent survey of industry practitioners found that data poisoning is the number one concern among threats ranging from model stealing to adversarial attacks. However, we find that the impressive performance evaluations from data poisoning attacks are, in large part, artifacts of inconsistent experimental design. Moreover, we find that existing poisoning methods have been tested in contrived scenarios, and many fail in more realistic settings. In order to promote fair comparison in future work, we develop standardized benchmarks for data poisoning and backdoor attacks.

1. INTRODUCTION

Data poisoning is a security threat to machine learning systems in which an attacker controls the behavior of a system by manipulating its training data. This class of threats is particularly germane to deep learning systems because they require large amounts of data to train and are therefore often trained (or pre-trained) on large datasets scraped from the web. For example, the Open Images and the Amazon Products datasets contain approximately 9 million and 233 million samples, respectively, that are scraped from a wide range of potentially insecure, and in many cases unknown, sources (Kuznetsova et al., 2020; Ni, 2018) . At this scale, it is often infeasible to properly vet content. Furthermore, many practitioners create datasets by harvesting system inputs (e.g., emails received, files uploaded) or scraping user-created content (e.g., profiles, text messages, advertisements) without any mechanisms to bar malicious actors from contributing data. The dependence of industrial AI systems on datasets that are not manually inspected has led to fear that corrupted training data could produce faulty models (Jiang et al., 2017) . In fact, a recent survey of 28 industry organizations found that these companies are significantly more afraid of data poisoning than other threats from adversarial machine learning (Kumar et al., 2020) . A spectrum of poisoning attacks exists in the literature. Backdoor data poisoning causes a model to misclassify test-time samples that contain a trigger -a visual feature in images or a particular character sequence in the natural language setting (Chen et al., 2017; Dai et al., 2019; Saha et al., 2019; Turner et al., 2018) . For example, one might tamper with training images so that a vision system fails to identify any person wearing a shirt with the trigger symbol printed on it. In this threat model, the attacker modifies data at both train time (by placing poisons) and at inference time (by inserting the trigger). Triggerless poisoning attacks, on the other hand, do not require modification at inference time (Biggio et al., 2012; Huang et al., 2020; Muñoz-González et al., 2017; Shafahi et al., 2018; Zhu et al., 2019; Aghakhani et al., 2020b; Geiping et al., 2020) . A variety of innovative backdoor and triggerless poisoning attacks -and defenses -have emerged in recent years, but inconsistent and perfunctory experimentation has rendered performance evaluations and comparisons misleading. In this paper, we develop a framework for benchmarking and evaluating a wide range of poison attacks on image classifiers. Specifically, we provide a way to compare attack strategies and shed light on the differences between them. Our goal is to address the following weaknesses in the current literature. First, we observe that the reported success of poisoning attacks in the literature is often dependent on specific (and sometimes unrealistic) choices of network architecture and training protocol, making it difficult to assess the viability of attacks in real-world scenarios. Second, we find that the percentage of training data that an attacker can modify, the standard budget measure in the poisoning literature, is not a useful metric for comparisons. The flaw in this metric invalidates comparisons because even with a fixed percentage of the dataset poisoned, the success rate of an attack can still be strongly dependent on the dataset size, which is not standardized across experiments to date. Third, we find that some attacks that claim to be "clean label," such that poisoned data still appears natural and properly labeled upon human inspection, are not. Our proposed benchmarks measure the effectiveness of attacks in standardized scenarios using modern network architectures. We benchmark from-scratch training scenarios and also white-box and black-box transfer learning settings. Also, we constrain poisoned images to be clean in the sense of small perturbations. Furthermore, our benchmarks are publicly available as a proving ground for existing and future data poisoning attacks. The data poisoning literature contains attacks in a variety of settings including image classification, facial recognition, and text classification (Shafahi et al., 2018; Chen et al., 2017; Dai et al., 2019) . Attacks on the fairness of models, on on speech recognition, and recommendation engines have also been developed (Solans et al., 2020; Aghakhani et al., 2020a; Li et al., 2016; Fang et al., 2018; Hu et al., 2019; Fang et al., 2020) . While we acknowledge the merits of studying poisoning in a range of modalities, our benchmark focuses on image classification since it is by far the most common setting in the existing literature.

2. A SYNOPSIS OF TRIGGERLESS AND BACKDOOR DATA POISONING

Early poisoning attacks targeted support vector machines and simple neural networks (Biggio et al., 2012; Koh & Liang, 2017) . As poisoning gained popularity, various strategies for triggerless attacks on deep architectures emerged (Muñoz-González et al., 2017; Shafahi et al., 2018; Zhu et al., 2019; Huang et al., 2020; Aghakhani et al., 2020b; Geiping et al., 2020) . The early backdoor attacks contained triggers in the poisoned data and in some cases changed the label, thus were not clean-label (Chen et al., 2017; Gu et al., 2017; Liu et al., 2017) . However, methods that produce poison examples which don't visibly contain a trigger also show positive results (Chen et al., 2017; Turner et al., 2018; Saha et al., 2019) . Poisoning attacks have also precipitated several defense strategies, but sanitization-based defenses may be overwhelmed by some attacks (Koh et al., 2018; Liu et al., 2018; Chacon et al., 2019; Peri et al., 2019) . We focus on attacks that achieve targeted misclassification. That is, under both the triggerless and backdoor threat models, the end goal of an attacker is to cause a target sample to be misclassified as another specified class. Other objectives, such as decreasing overall test accuracy, have been studied, but less work exists on this topic with respect to neural networks (Xiao et al., 2015; Liu et al., 2019) . In both triggerless and backdoor data poisoning, the clean images, called base images, that are modified by an attacker come from a single class, the base class. This class is often chosen to be precisely the same class into which the attacker wants the target image or class to be misclassified. There are two major differences between triggerless and backdoor threat models in the literature. First and foremost, backdoor attacks alter their targets during inference by adding a trigger. In the works we consider, triggers take the form of small patches added to an image (Turner et al., 2018; Saha et al., 2019) . Second, these works on backdoor attacks cause a victim to misclassify any image containing the trigger rather than a particular sample. Triggerless attacks instead cause the victim to misclassify an individual image called the target image (Shafahi et al., 2018; Zhu et al., 2019; Aghakhani et al., 2020b; Geiping et al., 2020) . This second distinction between the two threat models is not essential; for example, triggerless attacks could be designed to cause the victim to misclassify a collection of images rather than a single target. To be consistent with the literature at large, we focus on triggerless attacks that target individual samples and backdoor attacks that target whole classes of images. We focus on the clean-label backdoor attack and the hidden trigger backdoor attack, where poisons are crafted with optimization procedures and do not contain noticeable patches (Saha et al., 2019; Turner et al., 2018) . For triggerless attacks, we focus on the feature collision and convex polytope methods, the most highly cited attacks of the last two years that have appeared at prominent ML conferences (Shafahi et al., 2018; Zhu et al., 2019) . We include the recent triggerless methods

