IS ADVERSARIAL TRAINING REALLY A SILVER BULLET FOR MITIGATING DATA POISONING?

Abstract

Indiscriminate data poisoning can decrease the clean test accuracy of a deep learning model by slightly perturbing its training samples. There is a consensus that such poisons can hardly harm adversarially-trained (AT) models when the adversarial training budget is no less than the poison budget, i.e., ϵ adv ≥ ϵ poi . This consensus, however, is challenged in this paper based on our new attack strategy that induces entangled features (EntF). The existence of entangled features makes the poisoned data become less useful for training a model, no matter if AT is applied or not. We demonstrate that for attacking a CIFAR-10 AT model under a reasonable setting with ϵ adv = ϵ poi = 8/255, our EntF yields an accuracy drop of 13.31%, which is 7× better than existing methods and equal to discarding 83% training data. We further show the generalizability of EntF to more challenging settings, e.g., higher AT budgets, partial poisoning, unseen model architectures, and stronger (ensemble or adaptive) defenses. We finally provide new insights into the distinct roles of non-robust vs. robust features in poisoning standard vs. AT models and demonstrate the possibility of using a hybrid attack to poison standard and AT models simultaneously. Our code is available at https://github.com/ WenRuiUSTC/EntF.

1. INTRODUCTION

Indiscriminate data poisoning aims to degrade the overall prediction performance of a machine learning model at test time by manipulating its training data. It has been increasingly important to understand indiscriminate data poisoning as web scraping becomes a common approach to obtaining large-scale data for training advanced models (Brown et al., 2020; Dosovitskiy et al., 2021) . Although slightly perturbing training samples has been shown to effectively poison deep learning models, there is a consensus that such poisons can hardly harm an adversarially-trained model when the perturbation budget in adversarial training, ϵ adv , is no less than the poison budget, ϵ poi , i.e., ϵ adv ≥ ϵ poi (Fowl et al., 2021a; b; Huang et al., 2021; Tao et al., 2021; Wang et al., 2021; Fu et al., 2022; Tao et al., 2022) . In particular, Tao et al. (2021) have proved that in this setting, adversarial training can serve as a principled defense against existing poisoning methods. However, in this paper, we challenge this consensus by rethinking data poisoning from a fundamentally new perspective. Specifically, we introduce a new poisoning approach that entangles the features of training samples from different classes. In this way, the entangled samples would hardly contribute to model training no matter whether adversarial training is applied or not, causing substantial performance degradation of models. Different from our attack strategy, existing methods commonly inject perturbations as shortcuts, as pointed out by Yu et al. (2022) . This ensures that the model wrongly learns the shortcuts rather than the clean features, leading to low test accuracy (on clean samples) (Segura et al., 2022a; Evtimov et al., 2021; Yu et al., 2022) . Figure 1 illustrates the working mechanism of our new poisoning approach, with a comparison to a reverse operation that instead aims to eliminate entangled features. Our new approach is also inspired by the conventional, noisy label-based poisoning approach (Biggio et al., 2012; 2011; Muñoz-González et al., 2017) , where entangled labels are introduced by directly flipping labels (e.g., assigning a "dog" (or "cat") label to both the "dog" and "cat" images) under a strong assumption that the labeling process of the target model can be manipulated. However, due to the imperceptibility constraint in the common clean-label setting, we instead propose to introduce entangled features represented in the latent space. Our work mainly makes three contributions: • We demonstrate that, contrary to the consensus view, indiscriminate data poisoning can actually decrease the clean test accuracy of adversarially-trained (AT) models to a substantial extent. Specifically, we propose EntF, a new poisoning approach that is based on inducing entangled features in the latent space of a pre-trained reference model. • We conduct extensive experiments to demonstrate the effectiveness of EntF against AT in the reasonable setting with ϵ adv = ϵ poi and also its generalizability to a variety of more challenging settings, such as AT with higher budgets, partial poisoning, unseen model architectures, and stronger (ensemble or adaptive) defenses. • We further highlight the distinct roles of non-robust vs. robust features in compromising standard vs. AT models and also propose hybrid attacks that are effective even when the defender is free to adjust their AT budget ϵ adv .

2. RELATED WORK

Data poisoning. Data poisoning aims to compromise a model's performance at test time by manipulating its training data. Related work on poisoning DNNs has mainly investigated targeted, backdoor, and indiscriminate poisoning. Different from backdoor (Gu et al., 2017; Liu et al., 2018; Salem et al., 2022) and targeted poisoning (Muñoz-González et al., 2017; Shafahi et al., 2018; Geiping et al., 2021) , which aim to degrade the model on specific (targeted) test samples, indiscriminate poisoning aims at arbitrary clean test samples. Traditional indiscriminate poisoning is based on injecting noisy labels (Biggio et al., 2012; 2011; Muñoz-González et al., 2017) ; however, they can be easily detected (Shafahi et al., 2018; Song et al., 2022) . Recent methods instead pursue "clean-label" poisons by adding imperceptible perturbations. These methods mainly use the error-minimization (Huang et al., 2021; Tao et al., 2021; Fu et al., 2022) or error-maximization loss (Fowl et al., 2021b) , with a pre-trained (Fowl et al., 2021b; Wang et al., 2021; Tao et al., 2021) or trained-from-scratch (Huang et al., 2021; Fu et al., 2022) reference model. However, these methods are known to be vulnerable to adversarial training (AT). In particular, two concurrent methods, ADVIN (Wang et al., 2021) and REM (Fu et al., 2022) , also attempt to poison AT models, but under easy settings with ϵ poi ≥ 2ϵ adv . Generating poisons using a feature-space loss has also been explored (Shafahi et al., 2018; Zhu et al., 2019; Geiping et al., 2021) , but without considering AT and in the field of targeted poisoning. Adversarial training. Adversarial training (AT) was recognized as the only promising solution so far to provide robustness against (test-time) adversarial examples (Athalye et al., 2018; Tramèr et al., 2020) . It was also recently proved to be a principled defense against indiscriminate poisoning (Tao et al., 2021) . The general idea of AT is to simply augment training data with adversarial examples generated in each training step. The single-step approach, FGSM, was initially used by the seminal work of Goodfellow et al. (2015) but has been found to be ineffective against multi-step attacks (Tramèr et al., 2017; Kurakin et al., 2017) . To address this limitation, Madry et al. ( 2018) have proposed the PGD-based AT, which uses the multi-step optimization to further enhance the robustness. Other state-of-the-art methods have been focused on improving this PGD-based AT by,



Figure 1: The t-SNE feature visualizations for (a) clean CIFAR-10 vs. poisoned CIFAR-10 achieved by our (b) EntF-pull and (c) EntF-push, which aim to induce entangled features. As a comparison, (d) uses a reverse objective of EntF-push and instead increases the model accuracy. Different from our EntF, existing methods lead to well-separable features (see Appendix A). All t-SNE visualizations in this paper are obtained from the same clean reference model with ϵ ref = 4.

