HIDDEN POISON: MACHINE UNLEARNING ENABLES CAMOUFLAGED POISONING ATTACKS

Abstract

We introduce camouflaged data poisoning attacks, a new attack vector that arises in the context of machine unlearning and other settings when model retraining may be induced. An adversary first adds a few carefully crafted points to the training dataset such that the impact on the model's predictions is minimal. The adversary subsequently triggers a request to remove a subset of the introduced points at which point the attack is unleashed and the model's predictions are negatively affected. In particular, we consider clean-label targeted attacks (in which the goal is to cause the model to misclassify a specific test point) on datasets including CIFAR-10, Imagenette, and Imagewoof. This attack is realized by constructing camouflage datapoints that mask the effect of a poisoned dataset.

1. INTRODUCTION

Machine Learning (ML) research traditionally assumes a static pipeline: data is gathered, a model is trained once and subsequently deployed. This paradigm has been challenged by practical deployments, which are more dynamic in nature. After initial deployment more data may be collected, necessitating additional training. Or, as in the machine unlearning setting (Cao & Yang, 2015) , we may need to produce a model as if certain points were never in the training set to begin with. 1While such dynamic settings clearly increase the applicability of ML models, they also make them more vulnerable. Specifically, they open models up to new methods of attack by malicious actors aiming to sabotage the model. In this work, we introduce a new type of data poisoning attack on models that unlearn training datapoints. We call these camouflaged data poisoning attacks. The attack takes place in two phases. In the first stage, before the model is trained, the attacker adds a set of carefully designed points to the training data, consisting of a poison set and a camouflage set. The model's behaviour should be similar whether it is trained on either the training data, or its augmentation with both the poison and camouflage sets. In the second phase, after the model is trained, the attacker triggers an unlearning request to delete the camouflage set. That is, the model must be updated to behave as though it were only trained on the training set plus the poison set. At this point, the attack is fully realized, and the model's performance suffers in some way. While such an attack could harm the model by several metrics, in this paper, we focus on targeted poisoning attacks -that is, poisoning attacks where the goal is to misclassify one particular point in the training set. Our contributions are the following: 1. We introduce camouflaged data poisoning attacks, demonstrating a new attack vector in dynamic settings including machine unlearning. 2. We realize these attacks in the targeted poisoning setting, giving an algorithm based on the gradient-matching approach of Geiping et al. (2021) . In order to make the model behavior comparable to as if the poison set were absent, we construct the camouflage set by generating a new set of points that undoes the impact of the poison set, an idea which may be of broader interest to the data poisoning community. 3. We demonstrate the efficacy of these attacks on a variety of models (SVMs and neural networks) and datasets (CIFAR-10 (Krizhevsky, 2009), Imagenette (Howard, 2019), and Imagewoof (Howard, 2019)). 1.1 PRELIMINARIES Machine Unlearning. A significant amount of legislation concerning the "right to be forgotten" has recently been introduced by governments around the world, including the European Union's General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and Canada's proposed Consumer Privacy Protection Act (CPPA). Such legislation requires organizations to delete information they have collected about a user upon request. A natural question is whether that further obligates the organizations to remove that information from downstream machine learning models trained on the data -current guidances (Information Commissioner's Office, 2020) and precedents (Federal Trade Commission, 2021) indicate that this may be the case. This goal has sparked a recent line of work on machine unlearning (Cao & Yang, 2015) .



A naive solution is to remove said points from the training set and re-train the model from scratch.



Figure 1: An illustration of a successful camouflaged targeted data poisoning attack. In Step 1, the adversary adds poison and camouflage sets of points to the (clean) training data. In Step 2, the model is trained on the augmented training dataset. It should behave similarly to if trained on only the clean data; in particular, it should correctly classify the targeted point. In Step 3, the adversary triggers an unlearning request to delete the camouflage set from the trained model. In Step 4, the resulting model misclassifies the targeted point.

Figure 2: Some representative images from Imagewoof. In each pair, the left figure is from the training dataset, while the right image has been adversarially manipulated. The top and bottom rows are images from the poison and camouflage set, respectively. In all cases, the manipulated images are clean label and nearly indistinguishable from the original image.

