INCOMPATIBILITY CLUSTERING AS A DEFENSE AGAINST BACKDOOR POISONING ATTACKS

Abstract

We propose a novel clustering mechanism based on an incompatibility property between subsets of data that emerges during model training. This mechanism partitions the dataset into subsets that generalize only to themselves, i.e., training on one subset does not improve performance on the other subsets. Leveraging the interaction between the dataset and the training process, our clustering mechanism partitions datasets into clusters that are defined by-and therefore meaningful tothe objective of the training process. We apply our clustering mechanism to defend against data poisoning attacks, in which the attacker injects malicious poisoned data into the training dataset to affect the trained model's output. Our evaluation focuses on backdoor attacks against deep neural networks trained to perform image classification using the GTSRB and CIFAR-10 datasets. Our results show that (1) these attacks produce poisoned datasets in which the poisoned and clean data are incompatible and (2) our technique successfully identifies (and removes) the poisoned data. In an end-to-end evaluation, our defense reduces the attack success rate to below 1% on 134 out of 165 scenarios, with only a 2% drop in clean accuracy on CIFAR-10 and a negligible drop in clean accuracy on GTSRB.

1. OVERVIEW

Clustering, which aims to partition a dataset to surface some underlying structure, has long been a fundamental technique in machine learning and data analysis, with applications such as outlier detection and data mining across diverse fields including personalized medicine, document analysis, and networked systems (Xu & Tian, 2015) . However, traditional approaches scale poorly to large, high-dimensional datasets, which limits the utility of such techniques for the increasingly large, minimally curated datasets that have become the norm in deep learning (Zhou et al., 2022) . We present a new clustering technique that partitions the dataset according to an incompatibility property that emerges during model training. In particular, it partitions the dataset into subsets that generalize only to themselves (relative to the final trained model) and not to the other subsets. Directly leveraging the interaction between the dataset and the training process enables our technique to partition large, high-dimensional datasets commonly used to train neural networks (or, more generally, any class of learned models) into clusters defined by-and therefore meaningful to-the task the network is intended to perform. Our technique operates as follows. Given a dataset, it iteratively samples a subset of the dataset, trains a model on the subset, then selects the elements within the larger dataset that the model scores most highly. These selected elements comprise a smaller dataset for the next refinement step. By decreasing the size of the selected subset in each iteration, this process produces a final subset that (ideally) contains only data which is compatible with itself. The identified subset is then removed from the starting dataset, and the process is repeated to obtain a collection of refined subsets and corresponding trained models. Finally, a majority vote using models trained on the resulting refined subsets merges compatible subsets to produce the final partitioning.

