WITCHES' BREW: INDUSTRIAL SCALE DATA POISON-ING VIA GRADIENT MATCHING

Abstract

Data Poisoning attacks modify training data to maliciously control a model trained on such data. In this work, we focus on targeted poisoning attacks which cause a reclassification of an unmodified test image and as such breach model integrity. We consider a particularly malicious poisoning attack that is both "from scratch" and "clean label", meaning we analyze an attack that successfully works against new, randomly initialized models, and is nearly imperceptible to humans, all while perturbing only a small fraction of the training data. Previous poisoning attacks against deep neural networks in this setting have been limited in scope and success, working only in simplified settings or being prohibitively expensive for large datasets. The central mechanism of the new attack is matching the gradient direction of malicious examples. We analyze why this works, supplement with practical considerations. and show its threat to real-world practitioners, finding that it is the first poisoning method to cause targeted misclassification in modern deep networks trained from scratch on a full-sized, poisoned ImageNet dataset. Finally we demonstrate the limitations of existing defensive strategies against such an attack, concluding that data poisoning is a credible threat, even for large-scale deep learning systems.

1. INTRODUCTION

Machine learning models have quickly become the backbone of many applications from photo processing on mobile devices and ad placement to security and surveillance (LeCun et al., 2015) . These applications often rely on large training datasets that aggregate samples of unknown origins, and the security implications of this are not yet fully understood (Papernot, 2018) . Data is often sourced in a way that lets malicious outsiders contribute to the dataset, such as scraping images from the web, farming data from website users, or using large academic datasets scraped from social media (Taigman et al., 2014) . Data Poisoning is a security threat in which an attacker makes imperceptible changes to data that can then be disseminated through social media, user devices, or public datasets without being caught by human supervision. The goal of a poisoning attack is to modify the final model to achieve a malicious goal. In this work we focus on targeted attacks We show that efficient poisoned data causing targeted misclassfication can be created even in the setting of deep neural networks trained on large image classification tasks, such as ImageNet (Russakovsky et al., 2015) . Previous work on targeted data poisoning has often focused on either linear classification tasks (Biggio et al., 2012; Xiao et al., 2015; Koh et al., 2018) or poisoning of transfer learning and fine tuning (Shafahi et al., 2018; Koh & Liang, 2017) rather than a full end-to-end training pipeline. Attacks on deep neural networks (and especially on ones trained from scratch) have proven difficult in Muñoz-González et al. (2017) and Shafahi et al. (2018) . Only recently were targeted attacks against neural networks retrained from scratch shown to be possible in Huang et al. (2020) for CIFAR-10 -however with costs that render scaling to larger datasets, like the ImageNet dataset, prohibitively expensive. We formulate targeted data poisoning as the problem of solving a gradient matching problem and analyze the resulting novel attack algorithm that scales to unprecedented dataset size and effectiveness. Crucially, the new poisoning objective is orders-of-magnitude more efficient than a previous formulation based on on meta-learning (Huang et al., 2020) and succeeds more often. We conduct an experimental evaluation, showing that poisoned datasets created by this method are robustly compromised and significantly outperform other attacks on CIFAR-10 on the benchmark of Schwarzschild et al. (2020) . We then demonstrate reliably successful attacks on common ImageNet models in realistic training scenarios. For example, the attack successfully compromises a ResNet-34 by manipulating only 0.1% of the data points with perturbations less than 8 pixel values in ∞ -norm. We close by discussing previous defense strategies and how strong differential privacy (Abadi et al., 2016) is the only existing defense that can partially mitigate the effects of the attack.

2. RELATED WORK

The task of data poisoning is closely related to the problem of adversarial attacks at test time, also referred to as evasion attacks (Szegedy et al., 2013; Madry et al., 2017) , where the attacker alters a target test image to fool an already-trained model. This attack is applicable in scenarios where the attacker has control over the target image, but not over the training data. In this work we are specifically interested in targeted data poisoning attacks -attacks which aim to cause a specific target



Figure 1: The poisoning pipeline. Poisoned images (labrador retriever class) are inserted into a dataset and cause a newly trained victim model to mis-classify a target (otter) image. We show successful poisons for a threat model where 0.1% of training data is changed within an ∞ bound of ε = 8. Further visualizations of poisoned data can be found in the appendix.

