INCOMPATIBILITY CLUSTERING AS A DEFENSE AGAINST BACKDOOR POISONING ATTACKS

Abstract

We propose a novel clustering mechanism based on an incompatibility property between subsets of data that emerges during model training. This mechanism partitions the dataset into subsets that generalize only to themselves, i.e., training on one subset does not improve performance on the other subsets. Leveraging the interaction between the dataset and the training process, our clustering mechanism partitions datasets into clusters that are defined by-and therefore meaningful tothe objective of the training process. We apply our clustering mechanism to defend against data poisoning attacks, in which the attacker injects malicious poisoned data into the training dataset to affect the trained model's output. Our evaluation focuses on backdoor attacks against deep neural networks trained to perform image classification using the GTSRB and CIFAR-10 datasets. Our results show that (1) these attacks produce poisoned datasets in which the poisoned and clean data are incompatible and (2) our technique successfully identifies (and removes) the poisoned data. In an end-to-end evaluation, our defense reduces the attack success rate to below 1% on 134 out of 165 scenarios, with only a 2% drop in clean accuracy on CIFAR-10 and a negligible drop in clean accuracy on GTSRB. We have open sourced our implementation at https://github.com/charlesjin/ compatibility_clustering/. This section presents our clustering technique for datasets with subpopulations of incompatible data. In our formulation, two subsets are incompatible when training on one subset does not improve performance on the other subset. Because we cannot measure performance on, e.g., holdout data with ground truth labels, we introduce a "self-expansion" property: given a set of data points, we first subsample the data to obtain a smaller training set, then measure how well the learning algorithm generalizes to the full set of data. Our main theoretical result is that a set which achieves optimal expansion is homogeneous, i.e., composed entirely of data from a single subpopulation. Finally, we propose the Inverse Self-Paced Learning algorithm, which uses an approximation of the selfexpansion objective to partition the dataset. We present our theoretical results in the context of a basic binary classification setting. Let X be the input space, Y = {-1, +1} be the label space, and L(•, •) be the 0-1 loss function over Y × Y , i.e.,

1. OVERVIEW

Clustering, which aims to partition a dataset to surface some underlying structure, has long been a fundamental technique in machine learning and data analysis, with applications such as outlier detection and data mining across diverse fields including personalized medicine, document analysis, and networked systems (Xu & Tian, 2015) . However, traditional approaches scale poorly to large, high-dimensional datasets, which limits the utility of such techniques for the increasingly large, minimally curated datasets that have become the norm in deep learning (Zhou et al., 2022) . We present a new clustering technique that partitions the dataset according to an incompatibility property that emerges during model training. In particular, it partitions the dataset into subsets that generalize only to themselves (relative to the final trained model) and not to the other subsets. Directly leveraging the interaction between the dataset and the training process enables our technique to partition large, high-dimensional datasets commonly used to train neural networks (or, more generally, any class of learned models) into clusters defined by-and therefore meaningful to-the task the network is intended to perform. Our technique operates as follows. Given a dataset, it iteratively samples a subset of the dataset, trains a model on the subset, then selects the elements within the larger dataset that the model scores most highly. These selected elements comprise a smaller dataset for the next refinement step. By decreasing the size of the selected subset in each iteration, this process produces a final subset that (ideally) contains only data which is compatible with itself. The identified subset is then removed from the starting dataset, and the process is repeated to obtain a collection of refined subsets and corresponding trained models. Finally, a majority vote using models trained on the resulting refined subsets merges compatible subsets to produce the final partitioning. Our evaluation focuses on backdoor data poisoning attacks (Chen et al., 2017; Adi et al., 2018) against deep neural networks (DNNs), in which the attacker injects a small amount of poisoned data into the training dataset to install a backdoor that can be used to control the network's behavior during deployment. For example, Gu et al. (2019) install a backdoor in a traffic sign classifier, which causes the network to misclassify stop signs as speed limit signs when a (physical) sticker is applied. Prior work has found that directly applying classical outlier detection techniques to the dataset fails to separate the poisoned and clean data (Tran et al., 2018) . A key insight is that training on poisoned data should not improve accuracy on clean data (and vice versa) . Hence, the poisoned data satisfies the incompatibility property and can be separated using our clustering technique. This paper makes the following contributions: Clustering with Incompatibility and Self-Expansion. Section 2 defines incompatibility between subsets of data based on their interaction with the training algorithm, specifically that training on one subset does not improve performance on the other. We prove that for datasets containing incompatible subpopulations, the subset which is most compatible with itself (as measured by the self-expansion error) must be drawn from a single subpopulation. Hence, iteratively identifying such subsets based on this property provably separates the dataset along the original subpopulations. We present a tractable clustering mechanism that approximates this optimization objective. Formal Characterization and Defense for Data Poisoning Attacks. Section 3 presents a formal characterization of data poisoning attacks based on incompatibility, namely, that the poisoned data is incompatible with the clean data. We provide theoretical evidence to support this characterization by showing that a backdoor attack against linear classifiers provably satisfies the incompatibility property. We present a defense that leverages the clustering mechanism to partition the dataset into incompatible clean and poisoned subsets. Experimental Evaluation. Section 4 presents an empirical evaluation of the ability of the incompatibility property presented in Section 2 and the techniques developed in Section 3 to identify and remove poisoned data to defend against backdoor data poisoning attacks. We focus on three different backdoor attacks (two dirty label attacks using pixel-based patches and image-based patches, respectively, and a clean-label attack using adversarial perturbations) from the data poisoning literature (Gu et al., 2019; Gao et al., 2019; Turner et al., 2018) . The results (1) indicate that the considered attacks produce poisoned datasets that satisfy the incompatibility property and (2) demonstrate the effectiveness of our clustering mechanism in identifying poisoned data within a larger dataset containing both clean and poisoned data. For attacks against the GTSRB and CIFAR-10 datasets, our defense mechanism reduces the attack success rate to below 1% on 134 out of the 165 scenarios, with a less than 2% drop in clean accuracy. Compared to three previous defenses, our method successfully defends against the standard dirty-label backdoor attack in 22 out of 24 scenarios, versus only 5 for the next best defense. L(y 1 , y 2 ) = (1 -y 1 * y 2 )/2. Given a target distribution D supported on X × Y and a parametric family of functions f θ , a standard objective in this setting is to find θ minimizing the population risk R(θ) := X×Y L(f θ (x), y)dP D (x, y). Given n data points D = {(x 1 , y 1 ), . . . , (x n , y n )}, the empirical risk is defined as R emp (θ; D) := n -1 n i=1 L(f θ (x i ), y i ). We fix a learning algorithm A(•) which takes as input a training set and returns parameters θ. For example, if f θ is a family of DNNs, then A might train a DNN via stochastic gradient descent.

2.2. SELF-EXPANSION AND COMPATIBILITY

To measure generalization without holdout data and independent of any ground truth labels, we propose the following self-expansion property of sets: Definition 2.1 (Self-expansion of sets.). Let S and T be sets (S nonempty), and let α ∈ (0, 1] denote the subsampling rate. We define the α-expansion error of S given T as ϵ(S|T ; α) := E S ′ ∼S∪T [R emp (A(S ′ ); S)] (1) where the expectation is over both the randomness in A and S ′ , a random variable that samples a uniformly random α fraction per class (rounded up) from S ∪ T . When T = ∅ we also write ϵ(S; α). Here α is a hyperparameter that is typically selected a priori on the basis of some domain knowledge (for example, if α = 1 and A trains a deep neural network, then the network can memorize the training set). A typical value of α is 1/4 or 1/2; in our ablation studies, we find that our particular application is robust to a wide range of α. The self-expansion property measures the ability of the learning algorithm A to generalize from a subset of S ∪ T to the empirical distribution of S. Intuitively, when T = ∅, a smaller expansion error means that the set S is both "easier" and "more homogeneous" with respect to the learning algorithm, as a random subset is sufficient to learn the contents of S. When T is not empty, we expect that the self-expansion error will be lower when T contains similar data as S (for instance, if both S and T are drawn from the same distribution D). This observation leads directly to our formal definition of compatibility between sets: Definition 2.2 (Compatibility of sets.). A set T is α-compatible with set S if ϵ(S|T ; α) ≤ ϵ(S; α). (2) Conversely, T is α-incompatible with S if the opposite holds, i.e., ϵ(S; α) ≤ ϵ(S|T ; α). (3) Furthermore, T is completely α-compatible with S if every subset of T is α-compatible with every subset of S, and similarly for complete incompatibility. In other words, T is (formally) compatible with S if T improves the ability of A to generalize to S.

2.3. SEPARATION OF INCOMPATIBLE DATA

Our next result allows us to separate subpopulations of data satisfying the incompatibility property (proof in Appendix A). The main idea is that, given any set, we can always achieve better selfexpansion by removing incompatible data (if it exists). Theorem 2.3 (Sets minimizing expansion error are homogeneous.). Let S = A ∪ B be a set with A and B nonempty and mutually completely α-incompatible. Define ϵ * := min S ′ ⊆S ϵ(S ′ ; α), let S * be the collection of subsets of S that achieve ϵ * . Then for any smallest subset S * min ∈ S * , either S * min ⊆ A or S * min ⊆ B. Furthermore, if at least one of the incompatibilities is strict, then for all S * ∈ S * we have either S * ⊆ A or S * ⊆ B. Theorem 2.3 suggests an iterative procedure for separating subsets of data that satisfy the incompatibility property. Namely, at each step i, identify S * min in the current dataset D i , then repeat the procedure with D i+1 = D i \ S * min . The procedure terminates when D i+1 = ∅. Assuming the subpopulations in the dataset satisfy the incompatibility property, this process partitions the training set into subsets that are guaranteed to contain data from a single subpopulation. Furthermore, if subpopulations satisfy strict incompatibility, we can instead take the largest S * ∈ S * at each step. The next section develops this insight into a practical clustering mechanism for incompatible subpopulations. Algorithm 1 Inverse Self-Paced Learning Input: training set S, total iterations N , annealing schedule 1 ≥ β 0 ≥ ... ≥ β N = β min > 0, expansion α ≤ 1, momentum η ∈ [0, 1], learning algorithm A, initial parameters θ 0 Output: S N ⊆ S such that |S N | = β min |S| 1: S 0 ← S 2: L ← 0 3: for t = 1 to N do 4: S ′ ← SAMPLE(S t-1 , α) 5: θ t ← A(S ′ , θ t-1 ) 6: L ← ηL + (1 -η)R emp (θ t ; S) 7: S t ← TRIM(L, β t ) 8: end for

2.4. INVERSE SELF-PACED LEARNING

A major question raised by Theorem 2.3 is how to identify the set S * . In general, even computing a single self-expansion error of a given set S is intractable as it involves taking the expectation over all subsets of size α|S|. The Inverse Self-Paced Learning (ISPL) algorithm presented below solves this intractibilty problem. Rather than optimizing over all possible subsets of the training data, we instead seek to minimize the expansion error over subsets of fixed size β|S|. The new optimization objective is defined as: S * β := arg min S ′ ⊆S:|S ′ |=β|S| ϵ(S ′ ; α) We alternate between optimizing parameters θ t and the training subset S t -given S t-1 , we update θ t using a single subset from S t-1 of size α|S t-1 |. Then we use θ t to compute the loss for each element in S, and set S t to be the β fraction of the samples with the lowest losses. To encourage stability of the learning algorithm, the losses are smoothed by a momentum term η. To encourage more global exploration in the initial stages, we anneal the parameter β from an initial value β 0 down to the target value β min . Algorithm 1 presents the full algorithm. SAMPLE takes a set S and samples an α fraction of each class uniformly at random; TRIM takes losses L and returns the β fraction with the lowest loss. To conclude, we show for certain choices of parameters that Algorithm 1 converges to a local optimum of the following objective over the training set S: F (θ t , S t ; β t ) := i∈St L(f θt (x i ), y i ) + max(0, β t |S| -|S t |) where S t is a subset of S and β t is decreasing in t (proof in Appendix A). Proposition 2.4. Set α = 1 and η = 0 in Algorithm 1, and assume that A returns the empirical risk minimizer. Then we have that for each round of the algorithm, F (θ t , S t ; β t ) is decreasing in t and furthermore, |F (θ t , S t ; β t ) -F (θ t+1 , S t+1 ; β t+1 )| t→∞ ---→ 0.

3. DEFENDING AGAINST BACKDOOR ATTACKS

We present a novel characterization of poisoning attacks based on the formal definition of compatibility introduced in Section 2. In our threat model, the attacker aims to affect the trained model's behavior in ways that do not emerge from clean data alone, so we characterize the poisoned data by its incompatibility with clean data. We prove this observation holds for a backdoor attack in the context of linear classifiers. As Theorem 2.3 yields a method of partitioning the dataset into clusters of clean and poisoned data (when the clean and poisoned data are incompatible), we present a boosting algorithm to identify which clusters contain clean data (and hence remove the poisoned data).

3.1. THREAT MODEL

The attacker's goal is to install a backdoor in the trained network that 1) changes the poisoned network's predictions on poisoned data while 2) preserving the accuracy of the poisoned network on clean data. For the backdoor to be meaningful, the behavior of a poisoned model on poisoned data should also be inconsistent with a model trained on clean data. To Let θ be the parameters of the network under attack. Given m fresh test samples T drawn iid from D, the attacker seeks to maximize the targeted misclassification rate (TMR): TMR(θ; T ) := 1 m m i=1 L(f θ (x i ), ¬y i ) * L(f θ (τ test (x i )), y i ), i.e., the attack succeeds if it can flip the label of a correctly-classified instance by applying the trigger τ test (•) at test time. This metric captures both the "hidden" nature of the backdoor (the first term rewards the attacker only if clean inputs are correctly classified) as well as control upon application of the trigger (the second term rewards the attacker only if poisoned inputs are misclassified).

3.1.1. DEFENSE CAPABILITIES AND OBJECTIVE

The defender is given only the poisoned dataset C ∪ P . In particular, τ train , τ test , A, and C are not available to the defender. The defender's objective is to return a sanitized dataset D ⊆ C ∪ P such that the learned parameters θ := A( D) are as close as possible to the ground-truth parameters θ * := A(C). Note that θ * is guaranteed to exhibit low targeted misclassification rate-the threat model requires that for fresh (x, y) drawn from D, f θ * (τ test (x)) = y with high probability.

3.2. CHARACTERIZING BACKDOOR POISONING ATTACKS USING INCOMPATIBILITY

The following property formalizes the intuition that increasing the amount of clean data in the training set should not improve the performance on poisoned data (and vice versa): Definition 3.1 (Incompatibility Property). Let C and P be the clean and poisoned data produced by an attacker according to threat model defined in Section 3.1. We say that the attacker satisfies the incompatibility property if C and P are mutually completely incompatible. Furthermore if at least one of the incompatibilities between C and P is strict, then the attacker satisfies the strict incompatibility property. Note that incompatibility is defined with respect to the learning algorithm A as well as the subsampling rate α. To motivate this definition, we exhibit a backdoor attack against linear classifiers which falls under the threat model in Section 3.1 and provably satisfies the incompatibility property: Theorem 3.2. Let the input space be X = R N (N > 1). We use the parametric family of functions f w,b consisting of linear separators given by f w,b (x) = sign(⟨w, x⟩ + b) for w ∈ R N , b ∈ R. A selects the maximum margin classifier. Then there exists a distribution D and perturbation functions τ train and τ test , such that if the adversary poisons 1/4 of each class at random, then with high probability over the initial training samples: 1. Given the clean dataset, A returns parameters with 0 empirical (and population) risk. 2. Given the poisoned dataset, A returns parameters with a targeted misclassification rate of 1. 3. The clean and poisoned data are mutually completely incompatible. The proof is in Appendix A. The idea is to construct a dataset such that there exist dimensions in the original, clean feature space that are uncorrelated with the true labels, i.e., consist of pure noise. The attacker makes use of these extra dimensions by replacing the noise with a backdoor signal. Although this result uses linear classifiers, the setting is analogous to many existing attacks against DNNs; for example, many backdoor attacks against image classifiers insert synthetic patches on the border of the image, which is a location that does not typically affect the original classification task.

Algorithm 2 Boosting Homogeneous Sets

Input: Homogeneous sets S 1 , ..., S N , number of samples n = N i=1 |S i |, learning algorithm A Output: Votes V 1 , ..., V N 1: C 1 , ..., C N ← 0 2: for i = 1 to N do 3: θ i ← A(S i ) 4: for k = 1 to N do 5: if LOSS 0,1 (θ i ; S k ) < |S k |/2 then 6: C k ← C k + |S i | 7: end if 8: end for 9: end for 10: for i = 1 to N do 11: V i ← C i > n/2 12: end for 3.3 IDENTIFICATION USING BOOSTING We conclude this section by showing how to identify the clean data given the output of our clustering mechanism, which is a collection of homogeneous (i.e., entirely clean or entirely poisoned) subsets. One idea is to measure the compatibility between each pair of subsets, then take the largest mutually compatible (sub)collection of subsets (under the assumption that the clean data is always compatible with itself). However in general this requires training on each subset multiple times in sequence. We instead propose a simplified boosting algorithm for identifying the clean data which involves training on each subset exactly once. The main idea is to fit a weak learner to each cluster, then use each weak learner to vote on the other clusters. The following sufficient conditions guarantee the success of our boosting algorithm: Property 3.3 (Compatibility property of disjoint sets). Let C ′ and C ′′ be any two disjoint sets of clean data, and let P ′ be a set of poisoned data. Then R emp (A(C ′ ); C ′′ ) < 1/2 ≤ R emp (A(C ′ ); P ′ ). The first inequality says that training and testing on disjoint sets of clean data is an improvement over random guessing. The second inequality says that training with clean data then testing on poisoned data is at best equivalent to random guessing; this property is natural in that the threat model already guarantees R emp (A(C); P ) is large (i.e., close to 1) for random poisoned data P . Algorithm 2 presents our approach for boosting from homogeneous sets. The subroutine LOSS 0,1 takes a set of parameters and a set S, and returns the total risk over S using the zero-one loss. Each subset's vote is weighted by its size. We have the following correctness guarantee (proof in Appendix A): Theorem 3.4 (Identification of clean samples). Let S 1 , ..., S N be a collection of homogeneous subsets containing either entirely clean or entirely poisoned data. Then if the clean and poisoned data satisfy Property 3.3, Algorithm 2 votes V i = True if S i is clean (and V i = False otherwise).

4. EXPERIMENTAL EVALUATION

We report empirical evaluations of the incompatibility property (Section 4.1) and the proposed defense (Section 4.2). We use the CIFAR-10 dataset (Krizhevsky & Hinton, 2009) for general image recognition, containing 10 classes of 5000 training and 1000 test images each, and the GTSRB dataset (Stallkamp et al., 2012) for traffic sign recognition, containing 43 classes with 26640 training and 12630 test images, respectively. The GTSRB training set is highly imbalanced, with classes ranging from 150 to 1500 instances. Our experiments consider three types of backdoor attacks. Dirty label backdoor (DLBD) attacks use small patches ranging from a single pixel to a 3x3 checkerboard pattern, following the implementation of Tran et al. (2018) . WATERMARK uses 8x8 images such as a copyright sign or a peace 

4.1. INCOMPATIBILITY BETWEEN CLEAN AND POISONED DATA

We evaluate whether the attacks satisfy the incompatibility property by measuring the gap between ϵ(C; α) and ϵ(C|P ; α) for clean data C and poisoned data P generated by the 3 attacks (DLBD, WATERMARK, CLBD) on CIFAR-10. Figure 2 presents selected plots for the DLBD (e.g., relabelling airplanes as birds) and GAN-based CLBD (e.g., perturbing airplanes toward other classes) scenarios. Along the x-axis, we increase the ratio of poisoned to clean data in the source class from 0 to 1. Our experimental procedure is as follows: we first sample 1/8 of each class in the clean dataset to create C. For each poison ratio, we sample P ′ ∼ P according to the poison ratio, train on P ′ ∪ C with α = 1/2, then take the mean over 5 samples of P ′ as the self-expansion error. The entire process is repeated for 5 samples of C for each poisoned dataset in our evaluations (8 total for DLBD, 5 total for GAN-based CLBD). To better present the gap ϵ(C|P ; α) -ϵ(C; α), we report Columns display different datasets and poisoning strategies (e.g., poisoning CIFAR-10 using the CLBD attack). The rows display different amounts of poison (e.g., ϵ = 5 refers to 5% of the source class being poisoned). An entry of the form "7 / 8" indicates that ISPL+B successfully defended 7 of the 8 scenarios in the corresponding setting. Please refer to Section 4.2 for a detailed description. CIFAR-10 GTSRB totals DLBD WATERMARK CLBD DLBD WATERMARK 1-to-1 all-to-1 all-to-all each metric as an absolute difference in accuracy from the unpoisoned case (ratio = 0). A trendline connects the medians. Our first observation is that the key quantity, ϵ(C|P ; α) -ϵ(C; α), is negative for all poisoned datasets, which empirically supports our insight that clean and poisoned data are incompatible (or even strictly incompatible), despite the qualitative difference behind the attack mechanisms. The trend in the top plot also indicates that the gap between ϵ(C|P ; α) and ϵ(C; α) increases with the amount of poisoned data. The next two plots show how the gap breaks down over the source and non-source classes. For the DLBD attack, the source class shows a strong negative trend in accuracy as the amount of poison increases, while accuracy on the non-source classes stays largely constant. This fact is consistent with the hypothesis that the poisoned data is impairing the accuracy of the classifier on clean data from the source class (e.g., clean airplanes are getting misclassified as birds). For the CLBD attack, even though the total clean accuracy decreases, the source class accuracy actually increases as the amount of poison increases. We attribute this phenomenon to the classifier learning the perturbed poisoned training instances in the source class, which yields improved accuracy on the clean instances as well. The decrease in clean accuracy is therefore due to a drop in accuracy among the non-source classes. Because the CLBD attack produces perturbations of the source class toward other classes (thus making the other classes more similar to, e.g., airplanes), this behavior is consistent with the classifier misclassifying clean instances of non-source classes as the source class. These observations are corroborated by the increase in accuracy on poisoned data in both cases, suggesting that the classifier's mistakes on clean data are correlated with increased accuracy on similar-looking poisoned data. Appendix D contains further discussion and results.

4.2. PERFORMANCE OF PROPOSED DEFENSE

We next evaluate the performance of our proposed defense on the full range of attacks. Most settings use one-to-one attacks (e.g., place a trigger on dogs and mislabel as cats; an ϵ percentage of the source class is poisoned), which is the most common setting in the literature (Gu et al., 2019) . For the CIFAR-10 DLBD scenario, we additionally use all-to-one (e.g., place a trigger on all non-cats and mislabel as cats; an ϵ/9 percentage of each non-target class is poisoned) and all-to-all attacks (i.e., place a trigger on any image and mislabel according to a cyclic permutation on classes; an ϵ percentage of every class is poisoned). Table 1 presents a summary of our results. Each entry is of the format "success / total", where total indicates the total number of different scenarios for the attack setting (i.e., different source and target class and trigger combinations), and a run is successful when the defense achieves TMR below 1%. We note that the clean accuracy is around 91% on CIFAR-10 and 94% on GTSRB after running ISPL+B, compared to around 93% and 94% when training on the full, unpoisoned datasets, respectively. These results indicate that our approach succeeds in defending against various backdoor poisoning attacks, particularly for the standard settings of ϵ = 5, 10 in the literature (Tran et al., 2018) , for a small cost in clean accuracy. Appendix D contains the unabridged results. Comparison with Existing Approaches. Figure 3 compares the performance of our defense with 3 existing defenses-Spectral Signatures (Tran et al., 2018) , Iterative Trimmed Loss Minimization (Shen & Sanghavi, 2019) , and Activation Clustering (Chen et al., 2018) -on the CIFAR-10 DLBD (1-to-1) scenario. ISPL+B outperforms all 3 baselines in terms of reducing the TMR. Table 2 in Appendix D contains the detailed results of this experiment.

5. RELATED WORK

Many prior works propose methods for defending against backdoor attacks on neural networks. One common strategy, which can be viewed as a type of deep clustering (Zhou et al., 2022) , is to first fit a deep network to the poisoned distribution, then partition on the learned representations. The Activation Clustering defense (Chen et al., 2018) runs k-means clustering (k=2) on the activation patterns, then discards the smallest cluster. Tran et al. (2018) propose a defense based on the assumption that clean and poisoned data are separated by their top eigenvalue in the spectrum of the activations. TRIM (Jagielski et al., 2021) and ITLM (Shen & Sanghavi, 2019) iteratively train on a subset of the data created by removing samples with high loss. However, each iteration removes the same small number of samples; both direct comparisons with ITLM and our ablation studies suggest that the dynamic resizing in ISPL is crucial to identifying the poison. Shan et al. (2022) is motivated by a similar intuition as our compatibility property; however their formalization differs significantly from ours, and they do not present a defense but rather a forensics technique, which is only able to identify the remaining poison given access to a known poisoned input. Finally, several works provide certified guarantees against data poisoning attacks (Steinhardt et al., 2017; Jia et al., 2022) ; in exchange for the stronger guarantee, these works defend a small fraction of the dataset, e.g., the state-of-the-art certifies 16% of CIFAR-10 against 50 poisoned images (Wang et al., 2022a) . We refer the reader to Wang et al. (2022b) for a comprehensive survey of data poisoning defenses. Appendix E contains a discussion of other related works.

6. CONCLUSION

Backdoor data poisoning attacks on deep neural networks are an emerging class of threats in the growing landscape of deployed machine learning applications. We present a new, state-of-the-art defense against backdoor attacks by leveraging a novel clustering mechanism based on an incompatibility property of the clean and poisoned data. Because our technique works with any class of trained models, we anticipate that our tools may be applied to a wide range of datasets and training objectives beyond data poisoning. More broadly, this work develops an underexplored perspective based on the interaction between the data and training process. 

A DEFERRED PROOFS

Proof of Theorem 3.2. First we give a construction when the dimension N = 2. We define the training distribution as D = (D + + D -)/2, where D + and D -are both distributed arbitrarily within a ball of radius 1 (e.g., as a truncated Gaussian); D + is centered at (10, 0) with labels +1 and D -is centered at (-10, 0) with labels -1. The adversary chooses the perturbation function τ which flips the label during training and sets the last entry to be 10 if the original label was +1, and -10 if the original label was -1. Note that τ is bounded in the sense that as N → ∞, the (expected) magnitude of a data point can grow as O( √ N ) while the magnitude of the perturbation is O(1). Figure 4 depicts the construction. Clearly, one sample from each clean class is sufficient to learn a maximum margin classifier (MMC) achieving zero population risk, i.e., the MMC will be close to w = (1, 0), b = 0. Next, we show that the attacker succeeds under the threat model in Section 3.1. Recall that the attacker selects 1/4 of the positive samples and 1/4 of the negative samples to poison. Given a poisoned dataset containing at least one example of each class (both clean and poisoned), the classifier given by w = (1, -2), b = 0 achieves perfect accuracy on both the clean data and the poisoned data, and hence the MMC also has a poison misclassification rate of 1. We now show that this setup satisfies the incompatibility property defined in Section 3.2. We begin by showing that the poisoned data is not compatible with the clean data. Let C be any set of clean data, and let P be any set of poisoned data. We have two cases. First, if C contains samples of both classes, then by definition, any subset C ′ of C in the measurement of the self-expansion error will also include at least one sample of each clean class. Hence, the MMC learned from C ′ will be close to w = (1, 0), b = 0, which achieves perfect accuracy on all clean data. On the other hand, if C contains only samples of the positive class, the MMC is w = (0, 0), b = +1, and again all samples in C are correctly classified. The same argument works when C contains only samples from the negative class. Thus, in neither case can poisoned data improve the self-expansion error of C. As the exact same line of reasoning holds for proving incompatibility of clean data with respect to poisoned data, we omit the proof. Therefore the clean and poisoned data are mutually completely incompatible. To generalize this to N > 2, we can simply embed D in the first two dimensions of the space. The entire construction can also be scaled then translated and rotated without affecting any of the conclusions. Before moving on to the proof of Theorem 2.3, we first prove a useful lemma: Lemma A.1. Let A and B be arbitrary sets, and define S = A ∪ B. Then for all α, |S|ϵ(S; α) = |A|ϵ(A|B; α) + |B|ϵ(B|A; α) (9) with the convention that ϵ(∅|T ; α) = 0 for all T and α. Proof. Notice that all the expansion errors sample from the same training set S = A ∪ B. The lemma then follows from the linearity of the unnormalized empirical risk in the test set, i.e.,  |S|ϵ(S; α) = |S|E S ′ ∼S R emp (A(S ′ ); S) (10) = |S|E S ′ ∼S |S| -1 (xi,yi)∈S L(f A(S ′ ) (x i ), y i ) (11) = E S ′ ∼S (xi,yi)∈A L(f A(S ′ ) (x i ), y i ) + (xi,yi)∈B L(f A(S ′ ) (x i ), y i ) (12) = E S ′ ∼S |A|R emp (A(S ′ ); A) + E S ′ ∼S |B|R emp (A(S ′ ); B) (13) = |A|ϵ(A|B; α) + |B|ϵ(B|A; α), As |U | + |V | = |S * min |, it follows that at least one of ϵ(U ; α) or ϵ(V ; α) is at most ϵ * , which contradicts the optimality of S * min (as it was assumed to be a smallest set achieving ϵ * ). Now we consider the case when at least one of the incompatibilities is strict. Let S * be any set in S * . Using the definition of strict incompatibility, we get that |S * |ϵ * > |U |ϵ(U ; α) + |V |ϵ(V ; α). Hence, at least one of ϵ(U ; α) or ϵ(V ; α) is strictly less than ϵ * , which again contradicts the optimality of S * . Proof of Theorem 3.4. Let S i and S j be a clean component and poison component, respectively. By the second inequality in Property 3.3, we have that R emp (A(S i ); S j ) ≥ 1/2. Thus S i votes False on S j . Conversely, if both S i and S j are a clean components, then R emp (A(S i ); S j ) < 1/2, so S i votes True on S j . Putting these together and using the fact that at least half of the data is clean, we find that the poisoned components have weighted vote strictly less than |S|/2, while the clean components have weighted vote strictly greater than |S|/2. As all the clean components vote correctly, using the weighted majority correctly identifies all the clean and poisoned components as claimed. Proof of Proposition 2.4. Recall that α = 1 and η = 0. We prove the statement in two steps. First, we show that F (θ t+1 , S t ; β t ) ≤ F (θ t , S t ; β t ) The inequality follows from the optimization on Line 5 in Algorithm 1, which sets θ t+1 to the empirical risk minimizer of the set S t . Next, we claim that F (θ t+1 , S t+1 ; β t+1 ) ≤ F (θ t+1 , S t ; β t ) (19) Since |L(•, •)| ≤ 1, the optimal size of the set S t is |S t | = β t |S|. Since β t is decreasing, we have that |S t+1 | ≤ |S t |. Thus the number of elements in the trimmed empirical loss is non-increasing (Line 7, Algorithm 1). Combining the two inequalities shows that the objective function is decreasing in t. Since F (θ t , S t ; β t ) is a decreasing sequence bounded from below by zero, the monotone convergence theorem gives the second result.

B DATASET CONSTRUCTION DETAILS

Our implementation of the 1-to-1 dirty label backdoor (DLBD) adversary follows the threat model described in Gu et al. (2017) . For evaluation, we use the same dataset (CIFAR-10 ( Krizhevsky & Hinton, 2009) ) and setup for our experiments as the Spectral Signatures work (Tran et al., 2018) . Each scenario has a single source and target class, and we use the same (source, target) pairs as in Tran et al. (2018) : (airplane, bird), (automobile, cat), (bird, dog), (cat, dog), (cat, horse), (horse, deer), (ship, frog), (truck, bird). The perturbation function τ overlays a small pattern on the image at a fixed location. All patterns fit within a 3x3 pixel box. To generate a perturbation, we choose a shape (L-shape, X-shape, or pixel) uniformly at random. The (x,y) coordinates of the perturbation are randomly selected to guarantee that the entire shape is visible before data augmentation (e.g., the pixel-based perturbation can be placed anywhere within the 32x32 image, but the X-shape is larger and so must be centered in a 30x30 region, one pixel away from the border). The color of the perturbation is also selected uniformly at random, with each of the (R,G,B) coordinates ranging from 0 to 255. Finally, we randomly select an ϵ = 5, 10, 20% percentage of the source class, apply the perturbation by replacing the pixels in the corresponding locations with the selected shape and color, then relabel the poisoned images as the target class. For the all-to-1 case, for a given target class, we poison an equal proportion of every non-target class such that the total amount of poison is ϵ times the size of the source class. For the all-to-all case, we select a cyclic permutation of the classes, then poison a ϵ percentage of each class. The construction of the WATERMARK dataset is the same, except that we instead use 8x8 images depicting: a peace sign, the letter A, 3 bullet holes, or a colorful patch, taken from Nicolae et al. (2019) . Each trigger has a transparent background, and is blended into the upper left corner of the original image with 80% opacity (where 0% opacity would return the original image without the trigger, and 100% opacity would superimpose the trigger on top of the original image). For the CLBD dataset, we used the official datasets provided by Turner et al. (2018) . There are three datasets in total, one for each perturbation type: ℓ 2 , ℓ ∞ , and GAN. For the ℓ 2 and ℓ ∞ -based attacks, each image is perturbed independently to maximize the loss with respect to a reference model trained on clean data; the perturbation size is bounded by the respective norm. For the GAN-based attack, for each image to be poisoned, a perceptually similar target image in another class is identified using the latent space of the GAN, and the perturbed image is an interpolation of the original and target images in the latent space. In all three cases, a 3x3 checkerboard patch is then placed in each of the 4 corners of the image. The label of the image is not changed. At test time, only the checkerboard patches are placed on the image. We refer the reader to Turner et al. (2018) for more details. For the GTSRB dataset, we use the same construction of the DLBD and WATERMARK attacks on CIFAR-10. As the CLBD dataset was only created for the CIFAR-10 dataset, we did not evaluate the CLBD attack on GTSRB.

C DEFENSE SETUPS AND HYPERPARAMETERS

ISPL + Boosting (this work). For our defense, we use the set of hyperparameters described here unless otherwise noted. For CIFAR-10, we run 8 rounds of ISPL, each of which returns a component consisting of roughly 12% of the total samples. For GTSRB, we run 4 rounds of ISPL so that each component contains 24% of the total samples. Let p be the target percentage of samples over the remaining samples (e.g., for CIFAR-10 p ≈ 1/(8 -i + 1) in the i th iteration). Then the number of iterations N is set to 2 + min(3, 1/p). β starts at 3 * p in the first iteration, then drops linearly to its final value of p over the next 2 iterations. When trimming the training set, we also additionally include the top p/8 samples per class to prevent the network from collapsing to a trivial solution. For the learning procedure A, we use standard SGD, trained for 4 epochs per iteration, with a warm-up in the first iteration of 8 epochs. The expansion factor α is set to 1/4, and the momentum factor η is set to 0.9. We run ISPL 3 times to generate 24 weak learners for CIFAR-10, and 12 weak learners for GTSRB. For the boosting framework, each weak learner is trained for 40 epochs on its respective subset, then votes on a per-sample basis. The sample is preserved if the modal vote equals the given label, with ties broken randomly, or the sample is in the lower half of its class by loss. For CIFAR-10, we also include a final self-training step by training a fresh model for 100 epochs on the recovered samples. The main idea is that a model fit to the full "clean" training data can be used to test the excluded training data, thereby recovering additional consistent data which may have been originally excluded because the weak learners were fit to a small subset of data for fewer epochs. However, it may take several repetitions of training a model from scratch before this selftraining process no longer identifies new samples to recover. Therefore, we use a simple self-paced learning algorithm to dynamically adjust the samples during training to limit the self-training to a single iteration. More explicitly, we start with the "clean" samples as returned by the boosting framework. Every 5 epochs, we update the training set to be the samples whose labels agree with the model's current predictions. Due to the relative frequency with which we resample the training set, we smooth the predictions by a momentum factor of 0.8 so that the training process is less noisy. The samples used for training in the last epoch are returned as the defended dataset. In our experiments, this process decreases the false positive rate (and thus increases the clean accuracy) but does not materially affect the false negative rate (nor the targeted misclassification rate). We did not use self-training for GTSRB as we found the clean accuracy was sufficiently high. Spectral Signatures. We use the official implementation of the Spectral Signatures (SS) defense (Tran et al., 2018) by the authors, available on Github, except that we replace the training procedure with PyTorch (instead of Tensorflow 1.x as in the authors' original implementation). The authors suggest removing 1.5 times the maximum expected amount of poison from each class for the defense. We remove 20% of each class for ϵ = 5, 10% (to match the procedure in their paper) and 30% of each class for ϵ = 20%. In selecting the layer for the activations, for the ResNet32 architecture we use the input to the third block of the third layer (taken from the SS defense authors' public implementation), and for the PreActResNet18 architecture we use the input to the first block of the fourth layer (which was found empirically to remove the most poison on the first set of scenarios). We note that the authors indicate the defense should be fairly successful at any of the later layers of the network. Iterative Trimmed Loss Minimization. The Iterative Trimmed Loss Minimization (ITLM) defense (Shen & Sanghavi, 2019) consists of an iterative procedure. Given a setting 0 < α ≤ 1, one first trains a model for a number of epochs. Then the α fraction of samples with the lowest loss are retained for the next iteration. This process is repeated several times, with a fresh model beginning each iteration. The defended dataset is the α fraction of samples with the lowest loss after the last iteration. For the backdoor data poisoning experiments on CIFAR-10, the authors use 80 epochs for the first round of training, then 40 epochs thereafter; they also set α = 98% for ϵ = 5%, and do not test at other values of ϵ. We use the same settings, and scale α linearly with ϵ, i.e., α = 96% for ϵ = 10% and α = 92% for ϵ = 20%. Activation Clustering. The Activation Clustering (AC) defense (Chen et al., 2018) has an actively maintained official implementation in the Adversarial Robustness Toolbox (ART) (Nicolae et al., 2019) , an open-source collection of tools for security in machine learning. We use the official implementation with the default parameters values in ART v1.6.2. In selecting the layer for the activations, we used the same layers as for Spectral Signatures. Models. For the ResNet32 (He et al., 2016a ) model, we use vanilla SGD with learning rate 0.1, momentum 0.9, and weight decay 1e-4. For the final dataset, we train for 100 epochs (unless otherwise noted) and drop the learning rate by 10 at epochs 75 and 90. Using these parameters, we achieve around 94.5% accuracy on GTSRB when trained and tested with clean data. For the PreActResNet18 (He et al., 2016b ) model, we use vanilla SGD with learning rate 0.02, momentum 0.9, and weight decay 5e-4. For the final dataset, we train for 100 epochs (unless otherwise noted) and drop the learning rate by 10 at epochs 50, 75, and 90. Using these parameters, we achieve around 93.0% accuracy on CIFAR-10 when trained and tested with clean data. For the ResNet56 (He et al., 2016b ) model, we use vanilla SGD with learning rate 0.05, momentum 0.9, and weight decay 5e-4. For the final dataset, we train for 100 epochs (unless otherwise noted) and drop the learning rate by 10 at epochs 75, 90, 95, and 98. Using these parameters, we achieve around 91.0% accuracy on Imagenette when trained and tested with clean data.

D ADDITIONAL EXPERIMENTAL RESULTS

This section contains additional experimental results to complement the main text. Appendix D.1 expands upon the empirical evaluations of the compatibility property from Section 4.1. Appendix D.2 reports comprehensive results from the evaluations of the ISPL+B defense reported in Section 4.2. Appendix D.2 also includes a detailed breakdown of the performance for the three defenses (Spectral Signatures, Activation Clustering, and Iterative Trimmed Loss Minimization) we use as baselines in our comparisons. Appendix D.2.1 reports results for our defense against the Sleeper Agent attack (Souri et al., 2022) , and Appendix D.2.2 reports results for the DLBD attack on a high-resolution dataset. Appendix D.2.3 evaluates our defense against an adaptive attack that is given white-box access to the ISPL subroutine, and selects which elements to poison based on the behavior of ISPL on clean data. Our results show that ISPL+B maintains excellent performance against the adaptive attacker, defending against 20 out of 24 scenarios. Appendix D.2.4 reports results for the CIFAR-10 DLBD attack after training for 200 epochs on both the PreActResNet18 and ResNet32 architectures, to test the robustness of our results to the training setup. Finally, to evaluate the sensitivity of ISPL to the main hyperparameters α and β, Appendix D.2.5 compares the performance of ISPL on a selection of attack scenarios across a range of α and β. Our main finding is that performance remains high for a range of choices of α and β. All results reported use the median over 3 random seeds, unless otherwise noted.

D.1 INCOMPATIBILITY EVALUATION

Figures 5-9 displays aggregated results of the empirical study of incompatibility for the CLBD, DLBD, and WATERMARK attacks on CIFAR-10. Plots on the left side measure the compatibility of poisoned with clean data (i.e., ϵ(C; α) -ϵ(C|P ; α)) and plots on the right side measure the compatibility of clean with poisoned data (i.e., ϵ(P ; α) -ϵ(P |C; α)). In all cases, the trend in the first plot is decreasing, which suggests that the gap increases in magnitude as the size of the incompatible data grows. Additionally, the clean label scenarios all exhibit the behavior that the source class accuracy increases as the amount of poison (and poisoned accuracy) increases, while the non-source accuracy drops. These attacks share the characteristic that the poisoned data are "harder" versions of the source class in the sense that they are perturbed toward instances of other classes. In the case of the GAN-based attack, the perturbations occur in the latent space of a GAN trained on clean data (e.g., the attacker identifies a bird and an airplane that are close in the latent space, linearly interpolates the latent representation of the airplane toward the latent representation of the bird, then uses the generative network to convert the perturbed representation into an image of a "bird-like airplane"). In the case of the ℓ 2 and ℓ ∞ bounds, the attacker uses the Projected Gradient Descent (PGD) algorithm commonly used for adversarial training (Kurakin et al., 2016; Madry et al., 2017) to identify an image within a norm-bounded neighborhood of the original image that maximizes the loss of a trained network, i.e., the perturbed image is close in pixel-space to the original image, but was misclassified by the trained network as a different class. We therefore attribute the simultaneous increase in source class accuracy and decrease in non-source class accuracies to a similar mechanism for all the clean label cases. Conversely, when measuring the compatibility of poisoned data with clean data (plots on the left) for the DLBD and WATERMARK attacks, we see that, as the amount of poisoned data increases, the source class accuracy drops. The implication is that the poisoned data causes the classifier to misclassify clean data by presenting similar (both quantitatively in terms of the distance between the poisoned and clean distributions, as well as perceptually in most settings) instances of clean source data that are mislabelled as the target class, thus complicating the learning of the clean source class. Figures 10-14 plot the scenarios individually. We note that incompatibility holds generally across the range of scenarios, with only small deviations to the contrary. 

D.2 DEFENSE EVALUATION

This section includes additional experimental results concerning the performance of our defense. Tables 2-8 display the unabridged results of our main evaluation results for our defense (condensed in the main text as Table 1 ). We note that even though we have selected 1% as the threshold for a "successful" defense, in many of the failure cases, ISPL+B still achieves relatively low TMR. For instance, setting the threshold to 5% increases the number of successful scenarios to 142 (compared to 134 at 1%, out of 165 total). Table 2 also includes results of the three baseline defenses for the standard 1-to-1 DLBD attack on CIFAR-10. In particular, we observe that our clean accuracy is comparable to (or even slightly better than) that of Activation Clustering (AC), the next best defense, despite AC achieving significantly lower TMR in our evaluations. Table 4 : Performance on the CIFAR-10 DLBD scenario (all-to-all) using the PreActResNet18 architecture. The S → T column lists offset of the cyclic permutation (i.e., 2 means that source class 0 is poisoned as target class 2, source class 1 is poisoned as target class 3, etc.). ϵ refers to the percentage of the source class which is poisoned. We first used the official repository released by the authors to generate 24 poisoned datasets with the default settings for CIFAR-10 using different random seeds, targeting a ResNet32 as the victim architecture. 1 We observed an attack success rate (ASR) of 49.1% (±38.3%) over the generated datasets, when training a ResNet32 model with the attacker's original training hyperparameters. We then transferred the poisoned dataset to the setting of our defense (training a ResNet32, using the same training and defense hyperparameters as in our main experiments for ResNet32 on CIFAR-10). Our defense delivers an ASR of 4.3% (±4.1%), compared to an oracle, which achieves an ASR of 3.0% (±4.8%). However, we also observed that the attack only achieves a success rate of 4.5% (±4.3%) on an undefended model in our setting. We attribute this anomaly (between the ASR for an undefended model in our setting and the ASR reported by the attacker) to the difference in training hyperparameters, despite using the same victim architecture. In all three cases-defended, undefended, and oracle-the clean accuracy is consistent between 89-90%. We Note that in our threat model, the training algorithm A is fixed, and our defense is designed around the interaction between the data and A. The above scenario is thus explicitly not the intended use case for our algorithm. Nevertheless, our defense still achieves an ASR of 14.1% (±15.6%), with a clean accuracy of 87.5% (±0.6%), compared to the original ASR of 49.1%. An oracle achieves an ASR of 2.4% (±2.2). These results thus indicate that our defense successfully defends against the Sleeper Agent attack in our threat model (where the defense is integrated in the intended training pipeline), while also remaining robust to an evaluation under the attacker's original training setup. 

D.2.3 ADAPTIVE ATTACKER

The attacks used in the main experiments are oblivious to the presence of a defender when selecting the data to poison. In this section, we devise an strong adaptive attacker with white-box access to the ISPL algorithm when constructing the poisoned dataset. The success of the defense rests crucial on the partition of training data returned by the ISPL algorithm. In order for the boosting phase to succeed, the partition should consist of homogeneous sets, i.e., the poison must be concentrated in a small number of sets within the partition; hence, the adaptive attacker's objective is to try and distribute the poisoned samples across multiple sets. In particular, rather than randomly selecting a ϵ fraction of the source class to poison, an adaptive attacker with white-box access to the defense can first run the ISPL algorithm (using the same parameters) on the clean dataset to create a partition of the dataset, then random select a ϵ fraction from each partition to poison. The idea is that if the ISPL algorithm returns a similar partition, then the poison will be still be distributed across multiple sets, thereby breaking the defense. Table 10 presents the results of our defense against the strong adaptive attacker described above. Against the adaptive attack, our defense still succeeds in 20/24 of the scenarios (versus 22/24 for the standard oblivious attack). These results suggest the ISPL algorithm remains robust even to white-box adaptive attacks. As prior works (Schwarzschild et al., 2021) have observed that results in data poisoning, particularly with DNNs, can vary significantly with the training setup, we additionally ran our defense (along with the three baseline defenses) on the standard CIFAR-10 DLBD attack using both the PreActResNet18 and ResNet32 models, respectively, and report results after training for 200 epochs (instead of 100, as in our main experiments). Tables 11 and 12 report the results of these experiments. We observe that the defense success rates are largely consistent with those in the main experiments across all defenses; the biggest change is that the Spectral Signatures defense successfully defends against 9 scenarios instead of just 1 (out of 24 total). ISPL+B still achieves the best performance in terms of reducing the targeted misclassification rate below 1% (with a 2-3% drop in clean accuracy). We next conduct some ablation studies to better understand the effects of the two main hyperparameters in Algorithm 1: the expansion factor α and the subset size β. Computationally, larger β means fewer components and fewer outer iterations of ISPL (and is thus more efficient); in our main experiments, we use β = 1/8 and run ISPL 8 times in sequence to generate 8 components. Additionally, when estimating the self-expansion error, a larger α leads to lower variance estimates of the expansion error, but less effective measures of generalization; hence, we should prefer taking α as large as possible to produce better estimates (until memorization effects take over). Our experiments use both α = 1/2 (for GTSRB, due to the imbalanced class sizes) and α = 1/4 (for CIFAR-10). Tables 13 and 14 present our results from the ablation studies, conducted on a subset of the CIFAR-10 DLBD and CLBD subsets. We do not include the self-training step after ISPL as in the main experiments, in order to better highlight the trends. In particular, without self-training, we expect both lower clean accuracies and higher targeted misclassification rates due to not recovering all the compatible clean data. Our main finding is that our method is quite robust to both the expansion factor α and subset size β. In particular, β ∈ [1/8, 1/16] and α ∈ [1, 1/4] all yield strong defenses with less than 10% targeted misclassification. As expected, increasing β leads to a stronger defense but lower clean accuracies. In particular, β = 1 (which identifies only a single subset consisting of 96% of the original dataset) fails in all settings of α; this is the setting for which ISPL+B is conceptually the most similar to the ITLM defense, which offers a possibly explanation for ITLM's poor performance in our experiments. Additionally, α = 1 avoids memorization, mainly because we do not train until convergence within the ISPL loop. Finally, we note that α = 1/2 appears to achieve a modest optimum, which is particularly noticeable when paired with smaller β. (2015) for a survey of classical techniques. Classical techniques often do not scale to large, highdimensional datasets, so recent years have seen increasing development in techniques for deep clustering. Such techniques generally use deep learning to embed some high-dimensional input data into a low-dimensional space (such as the latent space of a generative adversarial network (Mukherjee et al., 2019) or variational autoencoder (Yang et al., 2019) ), then perform a classical clustering on the embeddings (Zhou et al., 2022) . The structure revealed by the clustering is thus tied to the objective used to learn the low-dimensional embedding, e.g., the reconstruction loss. In contrast, our clustering mechanism is based between the dataset and the training process, and therefore produces clusters based on the incompatibility of subsets with respect to a specific training objective (and can also be applied to any class of learned model). Compatibility. Our separation results in the context of data poisoning can be viewed as clustering by exploiting weak supervision in the form of (possibly poisoned) class labels. Balcan et al. (2005) introduce a method called "co-training", and show that under a similar compatibility property, learners fit independently to two different "views" of the data can supervise each other to improve the joint performance. However, they (and similar works) assume access to a small set of trusted labels not present in our setting and do not consider the case of malicious training data. Boosting. To identify which components belong together and recover the primary distribution, we use a method that can be interpreted as a basic form of boosting, in which an ensemble of weak learners is aggregated into a stronger learner. For instance, the seminal boosting algorithm AdaBoost (Freund et al., 1996) adaptively reweights the training set so that subsequent weak learners focus on samples of poor performance. The use of boosting methods for clustering has also been explored previously (Frossyniotis et al., 2004; Zhuowen Tu, 2005; Liu et al., 2007) ; as far as we know, we are the first to use our incompatibility objective to cluster data. Self-Paced Learning. Self-Paced learning (SPL) (Kumar et al., 2010) is a type of curriculum learning (CL) (Bengio et al., 2009) that dynamically creates a curriculum by including samples of increasing difficulty based on the losses of the partially trained model. Prior works generally apply SPL to improve convergence on noisy datasets (Meng et al., 2016; Jiang et al., 2018; Zhang et al., 2020) . In contrast, ISPL discards progressively easier samples to identify a subset with good expansion; to the best of our knowledge we are also the first to apply CL ideas to defend against backdoor attacks.



All results in this section are the mean over runs, with the standard error in parentheses, to match the presentation inSouri et al. (2022) .



begin, a set of n clean training samples D is drawn iid from D. After observing D, the attacker selects at most n/2 -1 samples A ⊆ D (leaving behind at least n/2 clean samples C = D \ A) and fixes a pair of perturbation functions τ train : X × Y → X × Y (the attacker uses τ train to construct the poisoned dataset for training, changing the image and potentially the label of each poisoned element) and τ test : X → X (the attacker uses τ test to construct poisoned test images without changing the ground-truth label of the poisoned image). The attacker then poisons the samples in A to obtain poisoned samples P = {τ train (a) | a ∈ A}. The threat model places no restrictions on τ train , but requires that τ test should not affect the predictions of a network trained on C. The final poisoned training set is C ∪ P .

Figure 1: Pairs of poisoned (top) and clean (bottom) samples for all datasets, selected for visible triggers. From left to right: CIFAR-10 with CLBD using GAN, ℓ 2 , and ℓ ∞ perturbations, DLBD, and WATERMARK attacks; and GTSRB with DLBD and WATERMARK attacks.

Figure 2: Results from incompatibility experiments on CIFAR-10, DLBD (left) and CLBD (right). Each subplot shows how accuracy changes over different subsets of data as the amount of poison increases. The top plot displays change in accuracy on clean training data, i.e., ϵ(C|P ; α) -ϵ(C; α); the next two plots separate the change in accuracy on C into source and non-source classes; the bottom plot shows change in accuracy on poisoned data P . All axes use linear scales.

Figure 3: Comparison with baseline defenses on the CIFAR-10 DLBD (1-to-1) scenario for ϵ = 5, 10, 20. Each plot displays TMRs (lower is better) over 24 runs. The orange lines indicate the median TMR, the boxes indicate the top and bottom quartile, and the whiskers cover the entire range of performance. The blue line displays the cutoff for success used inTable 1 (i.e., TMR <1%).

Figure 4: The construction used in our proof of Theorem 3.2, for N = 2 dimensions. The balls depict the support of the clean training distribution (blue) and the poisoned data (red). The perturbation function τ is shown by the grey arrows. The lines are the ground truth maximum margin classifier (blue) and the maximum margin classifier when training with a mixture of clean and poisoned data (red). The red and blue arrows point in the direction of half-plane labelled by the corresponding classifier as the positive class.

Proof of Theorem 2.3. Assume first that the incompatibility is not strict. Define U = S * min ∩ A and V = S * min ∩ B. From Theorem A.1, we have that |S * min |ϵ * = |U |ϵ(U |V ; α) + |V |ϵ(V |U ; α). (15) Applying now the definition of incompatibility gives |S * min |ϵ * ≥ |U |ϵ(U ; α) + |V |ϵ(V ; α).

Figure 5: Aggregated compatibility results for the GAN-based CLBD attack on CIFAR-10.

Figure 6: Aggregated compatibility results for the ℓ 2 -based CLBD attack on CIFAR-10.

Figure 7: Aggregated compatibility results for the ℓ ∞ -based CLBD attack on CIFAR-10.

Figure 8: Aggregated compatibility results for the WATERMARK attack on CIFAR-10.

Figure 9: Aggregated compatibility results for the DLBD attack on CIFAR-10.

Figure 10: Full compatibility results for the GAN-based CLBD attack on CIFAR-10.

Figure 11: Full compatibility results for the ℓ 2 -based CLBD attack on CIFAR-10.

Figure 12: Full compatibility results for the ℓ ∞ -based CLBD attack on CIFAR-10.

Figure 13: Full compatibility results for the DLBD attack on CIFAR-10.

Figure 14: Full compatibility results for the WATERMARK attack on CIFAR-10.

next evaluated the effectiveness of our defense when retraining with the attacker's training setup. Specifically, we took the poisoned training set generated by the attacker, then ran our defense using our training setup to output a new training set with the detected poisoned examples removed. We then used the attacker's training setup to train a new model on our new training set, and evaluated this new model on the poisoned test set.

Figure 15: Example clean (top) and poisoned (bottom) images from the train (left) and test (right) datasets, taken from one run of Sleeper Agent attack; the source class is horse, and the target class is dog (training with the poisoned dogs causes patched horses to be misclassified as dogs).

Summary of performance for ISPL+B across several datasets and poisoning strategies.



Performance on the CIFAR-10 DLBD scenario (1-to-1) using the PreActResNet18 architecture. The S / T column lists the CIFAR-10 source and target classes. ϵ refers to the percentage of the source class which is poisoned. For the remainder of the columns, the top level column headers

Performance on the CIFAR-10 DLBD scenario (all-to-1) using the PreActResNet18 architecture. The T column lists the target class (i.e., all poisoned images have label T). Each of the 9 source classes contain ϵ/9 percent poison.

Performance on the CIFAR-10 WATERMARK scenario (1-to-1) using the PreActResNet18 architecture. The S / T column lists the CIFAR-10 source and target classes. ϵ refers to the percentage of the source class which is poisoned.

Performance on the CIFAR-10 CLBD scenario using the PreActResNet18 architecture. The T column lists the target class (i.e., all poisoned images have label T). ϵ refers to the percentage of the target class which is poisoned.

Performance on the GTSRB DLBD scenario (1-to-1) using the ResNet32 architecture. The S / T column lists the GTSRB source and target classes. ϵ refers to the percentage of the source class which is poisoned.

Performance on the GTSRB WATERMARK scenario (1-to-1) using the ResNet32 architecture. The S / T column lists the GTSRB source and target classes. ϵ refers to the percentage of the source class which is poisoned.Souri et al., 2022) is a clean label attack that uses gradient matching(Geiping et al., 2020) to create a poisoned training set. The objective of the attacker is to induce a model trained on the poisoned training set to misclassify patched versions of a source class as the target class. Figure16displays examples of clean and poisoned training and test data created by the Sleeper Agent attack, where the source class is horse and the target class is dog (training with the poisoned dogs causes patched horses to be misclassified as dogs). Similar to the CLBD attack, Sleeper Agent perturbs each image in the training set independently subject to an ℓ ∞ bound, however, the Sleeper Agent attack is more subtle because it does not insert any patches during training. The Sleeper Agent attacker also differs from the DLBD, CLBD, and WATERMARK attacks in using an informed mechanism to select which subset of the training set D to poison (rather than poisoning a random subset). Finally, unlike prior gradient matching approaches, the Sleeper Agent attack is also intended to apply in the black-box setting, where the attacker does not have access to the victim model's architecture or training hyperparameters.

Performance on the Imagenette DLBD scenario (1-to-1) using the ResNet56 architecture. The S / T column lists the Imagenette source and target classes. ϵ refers to the percentage of the source class which is poisoned.

Performance on the CIFAR-10 DLBD scenario using the PreActResNet18 architecture with an adaptive attack. The S / T column lists the CIFAR-10 source and target classes. ϵ refers to the percentage of the source class which is poisoned.

Performance on the CIFAR-10 DLBD scenario (1-to-1) using the PreActResNet18 architecture, when training for the final model for 200 epochs. The S / T column lists the CIFAR-10 source and target classes. ϵ refers to the percentage of the source class which is poisoned.

Performance on the CIFAR-10 DLBD scenario (1-to-1) using the ResNet32 architecture, when training for the final model for 200 epochs. The S / T column lists the CIFAR-10 source and target classes. ϵ refers to the percentage of the source class which is poisoned.

Average clean accuracies after running ISPL on three scenarios of CIFAR-10 DLBD and CLBD using the PreActResNet18 architecture, with various settings of α and β.

Average targeted misclassification rates after running ISPL on three scenarios of CIFAR-10 DLBD and CLBD using the PreActResNet18 architecture, with various settings of α and β.Deep Clustering. The literature on clustering is vast, and we refer the reader toXu & Tian

ACKNOWLEDGMENTS

We gratefully acknowledge support from DARPA Grant HR001120C0015. The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government.

annex

Published as a conference paper at ICLR 2023 D.2.2 IMAGENETTE DATASET Imagenette (Howard, 2019) is a subset the ImageNet dataset (Deng et al., 2009) consisting of 9,469 training and 3,925 test images over 10 classes. We downsampled the 320x320 resolution version of Imagenette to 224x224, and use a ResNet56 achieving around 91% accuracy on the clean dataset. We used a DLBD adversary that places a solid 5x5 patch of a random color in a random position on the image. We use this setting to probe the applicability of our technique to higher-resolution datasets. Figure 16 displays examples of clean and poisoned data from the Imagenette experiments. Table 9 reports the results of our defense on Imagenette, using the same defense hyperparameters as the main experiments for CIFAR-10. The performance of our defense remains consistent, with 21 of the 24 scenarios defended below 1% TMR and less than 2% drop in clean accuracy. 

