INCOMPATIBILITY CLUSTERING AS A DEFENSE AGAINST BACKDOOR POISONING ATTACKS

Abstract

We propose a novel clustering mechanism based on an incompatibility property between subsets of data that emerges during model training. This mechanism partitions the dataset into subsets that generalize only to themselves, i.e., training on one subset does not improve performance on the other subsets. Leveraging the interaction between the dataset and the training process, our clustering mechanism partitions datasets into clusters that are defined by-and therefore meaningful tothe objective of the training process. We apply our clustering mechanism to defend against data poisoning attacks, in which the attacker injects malicious poisoned data into the training dataset to affect the trained model's output. Our evaluation focuses on backdoor attacks against deep neural networks trained to perform image classification using the GTSRB and CIFAR-10 datasets. Our results show that (1) these attacks produce poisoned datasets in which the poisoned and clean data are incompatible and (2) our technique successfully identifies (and removes) the poisoned data. In an end-to-end evaluation, our defense reduces the attack success rate to below 1% on 134 out of 165 scenarios, with only a 2% drop in clean accuracy on CIFAR-10 and a negligible drop in clean accuracy on GTSRB.

1. OVERVIEW

Clustering, which aims to partition a dataset to surface some underlying structure, has long been a fundamental technique in machine learning and data analysis, with applications such as outlier detection and data mining across diverse fields including personalized medicine, document analysis, and networked systems (Xu & Tian, 2015) . However, traditional approaches scale poorly to large, high-dimensional datasets, which limits the utility of such techniques for the increasingly large, minimally curated datasets that have become the norm in deep learning (Zhou et al., 2022) . We present a new clustering technique that partitions the dataset according to an incompatibility property that emerges during model training. In particular, it partitions the dataset into subsets that generalize only to themselves (relative to the final trained model) and not to the other subsets. Directly leveraging the interaction between the dataset and the training process enables our technique to partition large, high-dimensional datasets commonly used to train neural networks (or, more generally, any class of learned models) into clusters defined by-and therefore meaningful to-the task the network is intended to perform. Our technique operates as follows. Given a dataset, it iteratively samples a subset of the dataset, trains a model on the subset, then selects the elements within the larger dataset that the model scores most highly. These selected elements comprise a smaller dataset for the next refinement step. By decreasing the size of the selected subset in each iteration, this process produces a final subset that (ideally) contains only data which is compatible with itself. The identified subset is then removed from the starting dataset, and the process is repeated to obtain a collection of refined subsets and corresponding trained models. Finally, a majority vote using models trained on the resulting refined subsets merges compatible subsets to produce the final partitioning. Our evaluation focuses on backdoor data poisoning attacks (Chen et al., 2017; Adi et al., 2018) against deep neural networks (DNNs), in which the attacker injects a small amount of poisoned data into the training dataset to install a backdoor that can be used to control the network's behavior during deployment. For example, Gu et al. ( 2019) install a backdoor in a traffic sign classifier, which causes the network to misclassify stop signs as speed limit signs when a (physical) sticker is applied. Prior work has found that directly applying classical outlier detection techniques to the dataset fails to separate the poisoned and clean data (Tran et al., 2018) . A key insight is that training on poisoned data should not improve accuracy on clean data (and vice versa). Hence, the poisoned data satisfies the incompatibility property and can be separated using our clustering technique. This paper makes the following contributions: Clustering with Incompatibility and Self-Expansion. Section 2 defines incompatibility between subsets of data based on their interaction with the training algorithm, specifically that training on one subset does not improve performance on the other. We prove that for datasets containing incompatible subpopulations, the subset which is most compatible with itself (as measured by the self-expansion error) must be drawn from a single subpopulation. Hence, iteratively identifying such subsets based on this property provably separates the dataset along the original subpopulations. We present a tractable clustering mechanism that approximates this optimization objective. Formal Characterization and Defense for Data Poisoning Attacks. Section 3 presents a formal characterization of data poisoning attacks based on incompatibility, namely, that the poisoned data is incompatible with the clean data. We provide theoretical evidence to support this characterization by showing that a backdoor attack against linear classifiers provably satisfies the incompatibility property. We present a defense that leverages the clustering mechanism to partition the dataset into incompatible clean and poisoned subsets. Experimental Evaluation. Section 4 presents an empirical evaluation of the ability of the incompatibility property presented in Section 2 and the techniques developed in Section 3 to identify and remove poisoned data to defend against backdoor data poisoning attacks. We focus on three different backdoor attacks (two dirty label attacks using pixel-based patches and image-based patches, respectively, and a clean-label attack using adversarial perturbations) from the data poisoning literature (Gu et al., 2019; Gao et al., 2019; Turner et al., 2018) . The results (1) indicate that the considered attacks produce poisoned datasets that satisfy the incompatibility property and (2) demonstrate the effectiveness of our clustering mechanism in identifying poisoned data within a larger dataset containing both clean and poisoned data. For attacks against the GTSRB and CIFAR-10 datasets, our defense mechanism reduces the attack success rate to below 1% on 134 out of the 165 scenarios, with a less than 2% drop in clean accuracy. Compared to three previous defenses, our method successfully defends against the standard dirty-label backdoor attack in 22 out of 24 scenarios, versus only 5 for the next best defense. We have open sourced our implementation at https://github.com/charlesjin/ compatibility_clustering/.

2. INCOMPATIBILITY CLUSTERING

This section presents our clustering technique for datasets with subpopulations of incompatible data. In our formulation, two subsets are incompatible when training on one subset does not improve performance on the other subset. Because we cannot measure performance on, e.g., holdout data with ground truth labels, we introduce a "self-expansion" property: given a set of data points, we first subsample the data to obtain a smaller training set, then measure how well the learning algorithm generalizes to the full set of data. Our main theoretical result is that a set which achieves optimal expansion is homogeneous, i.e., composed entirely of data from a single subpopulation. Finally, we propose the Inverse Self-Paced Learning algorithm, which uses an approximation of the selfexpansion objective to partition the dataset.

2.1. SETTING

We present our theoretical results in the context of a basic binary classification setting. Let X be the input space, Y = {-1, +1} be the label space, and L(•, •) be the 0-1 loss function over Y × Y , i.e.,

