IS ADVERSARIAL TRAINING REALLY A SILVER BULLET FOR MITIGATING DATA POISONING?

Abstract

Indiscriminate data poisoning can decrease the clean test accuracy of a deep learning model by slightly perturbing its training samples. There is a consensus that such poisons can hardly harm adversarially-trained (AT) models when the adversarial training budget is no less than the poison budget, i.e., ϵ adv ≥ ϵ poi . This consensus, however, is challenged in this paper based on our new attack strategy that induces entangled features (EntF). The existence of entangled features makes the poisoned data become less useful for training a model, no matter if AT is applied or not. We demonstrate that for attacking a CIFAR-10 AT model under a reasonable setting with ϵ adv = ϵ poi = 8/255, our EntF yields an accuracy drop of 13.31%, which is 7× better than existing methods and equal to discarding 83% training data. We further show the generalizability of EntF to more challenging settings, e.g., higher AT budgets, partial poisoning, unseen model architectures, and stronger (ensemble or adaptive) defenses. We finally provide new insights into the distinct roles of non-robust vs. robust features in poisoning standard vs. AT models and demonstrate the possibility of using a hybrid attack to poison standard and AT models simultaneously. Our code is available at https://github.com/ WenRuiUSTC/EntF.

1. INTRODUCTION

Indiscriminate data poisoning aims to degrade the overall prediction performance of a machine learning model at test time by manipulating its training data. It has been increasingly important to understand indiscriminate data poisoning as web scraping becomes a common approach to obtaining large-scale data for training advanced models (Brown et al., 2020; Dosovitskiy et al., 2021) . Although slightly perturbing training samples has been shown to effectively poison deep learning models, there is a consensus that such poisons can hardly harm an adversarially-trained model when the perturbation budget in adversarial training, ϵ adv , is no less than the poison budget, ϵ poi , i.e., ϵ adv ≥ ϵ poi (Fowl et al., 2021a; b; Huang et al., 2021; Tao et al., 2021; Wang et al., 2021; Fu et al., 2022; Tao et al., 2022) . In particular, Tao et al. ( 2021) have proved that in this setting, adversarial training can serve as a principled defense against existing poisoning methods. However, in this paper, we challenge this consensus by rethinking data poisoning from a fundamentally new perspective. Specifically, we introduce a new poisoning approach that entangles the features of training samples from different classes. In this way, the entangled samples would hardly contribute to model training no matter whether adversarial training is applied or not, causing substantial performance degradation of models. Different from our attack strategy, existing methods commonly inject perturbations as shortcuts, as pointed out by Yu et al. (2022) . This ensures that the model wrongly learns the shortcuts rather than the clean features, leading to low test accuracy (on clean samples) (Segura et al., 2022a; Evtimov et al., 2021; Yu et al., 2022) . Figure 1 illustrates the working mechanism of our new poisoning approach, with a comparison to a reverse operation that instead aims to eliminate entangled features. Our new approach is also inspired by the conventional, noisy label-based poisoning approach (Biggio et al., 2012; 2011; Muñoz-González et al., 2017) , where entangled labels are introduced by directly flipping labels (e.g., assigning a "dog" (or "cat") label to both the "dog" and "cat" images) under a strong assumption that

