HIDDEN POISON: MACHINE UNLEARNING ENABLES CAMOUFLAGED POISONING ATTACKS

Abstract

We introduce camouflaged data poisoning attacks, a new attack vector that arises in the context of machine unlearning and other settings when model retraining may be induced. An adversary first adds a few carefully crafted points to the training dataset such that the impact on the model's predictions is minimal. The adversary subsequently triggers a request to remove a subset of the introduced points at which point the attack is unleashed and the model's predictions are negatively affected. In particular, we consider clean-label targeted attacks (in which the goal is to cause the model to misclassify a specific test point) on datasets including CIFAR-10, Imagenette, and Imagewoof. This attack is realized by constructing camouflage datapoints that mask the effect of a poisoned dataset.

1. INTRODUCTION

Machine Learning (ML) research traditionally assumes a static pipeline: data is gathered, a model is trained once and subsequently deployed. This paradigm has been challenged by practical deployments, which are more dynamic in nature. After initial deployment more data may be collected, necessitating additional training. Or, as in the machine unlearning setting (Cao & Yang, 2015) , we may need to produce a model as if certain points were never in the training set to begin with. 1While such dynamic settings clearly increase the applicability of ML models, they also make them more vulnerable. Specifically, they open models up to new methods of attack by malicious actors aiming to sabotage the model. In this work, we introduce a new type of data poisoning attack on models that unlearn training datapoints. We call these camouflaged data poisoning attacks. The attack takes place in two phases. In the first stage, before the model is trained, the attacker adds a set of carefully designed points to the training data, consisting of a poison set and a camouflage set. The model's behaviour should be similar whether it is trained on either the training data, or its augmentation with both the poison and camouflage sets. In the second phase, after the model is trained, the attacker triggers an unlearning request to delete the camouflage set. That is, the model must be updated to behave as though it were only trained on the training set plus the poison set. At this point, the attack is fully realized, and the model's performance suffers in some way. While such an attack could harm the model by several metrics, in this paper, we focus on targeted poisoning attacks -that is, poisoning attacks where the goal is to misclassify one particular point in the training set. Our contributions are the following: 1. We introduce camouflaged data poisoning attacks, demonstrating a new attack vector in dynamic settings including machine unlearning. 2. We realize these attacks in the targeted poisoning setting, giving an algorithm based on the gradient-matching approach of Geiping et al. (2021) . In order to make the model behavior comparable to as if the poison set were absent, we construct the camouflage set by generating a new set of points that undoes the impact of the poison set, an idea which may be of broader interest to the data poisoning community. 3. We demonstrate the efficacy of these attacks on a variety of models (SVMs and neural networks) and datasets (CIFAR-10 (Krizhevsky, 2009 ), Imagenette (Howard, 2019 ), and Imagewoof (Howard, 2019) ).



A naive solution is to remove said points from the training set and re-train the model from scratch. 1

