REVISITING THE ASSUMPTION OF LATENT SEPARA-BILITY FOR BACKDOOR DEFENSES

Abstract

Recent studies revealed that deep learning is susceptible to backdoor poisoning attacks. An adversary can embed a hidden backdoor into a model to manipulate its predictions by only modifying a few training data, without controlling the training process. Currently, a tangible signature has been widely observed across a diverse set of backdoor poisoning attacks -models trained on a poisoned dataset tend to learn separable latent representations for poison and clean samples. This latent separation is so pervasive that a family of backdoor defenses directly take it as a default assumption (dubbed latent separability assumption), based on which to identify poison samples via cluster analysis in the latent space. An intriguing question consequently follows: is the latent separation unavoidable for backdoor poisoning attacks? This question is central to understanding whether the assumption of latent separability provides a reliable foundation for defending against backdoor poisoning attacks. In this paper, we design adaptive backdoor poisoning attacks to present counter-examples against this assumption. Our methods include two key components: (1) a set of triggerplanted samples correctly labeled to their semantic classes (other than the target class) that can regularize backdoor learning; (2) asymmetric trigger planting strategies that help to boost attack success rate (ASR) as well as to diversify latent representations of poison samples. Extensive experiments on benchmark datasets verify the effectiveness of our adaptive attacks in bypassing existing latent separation based defenses.

1. INTRODUCTION

Overparameterized deep neural network (DNN) models can fit complex datasets perfectly and generalize well on i.i.d. data distributions. However, the strong capacity of these models also render them susceptible to backdoor poisoning attacks (Gu et al., 2017; Chen et al., 2017; Turner et al., 2019; Li et al., 2022) . In a backdoor poisoning attack, an adversary only manipulates a small portion of the victim's training dataset. The victims will train their own model on the manipulated dataset and consequently get a backdoored model. Typically, the adversary will poison the victim's dataset by injecting a small amount of backdoor poison samples, each of which contains a backdoor trigger (e.g. a specific pixel patch) and is labeled to a specific target class. A DNN model trained on this poisoned dataset will be backdoored in that they tend to learn an artificial correlation between the backdoor trigger and the target class. These attacks are stealthy since backdoored models behave normally on natural samples and therefore users can hardly identify them. Despite the stealthiness in terms of model performance on natural samples, it has been commonly observed (Tran et al., 2018; Chen et al., 2019; Huang et al., 2022) that backdoor poisoning attacks tend to leave tangible signatures in the latent space of backdoored models. As visualized in After planting the backdoor trigger to a set of samples, we do not mislabel all of them to the target class. Instead, we randomly keep a fraction of them (namely regularization samples) still correctly labeled to their real semantic classes. Intuitively, these additional regularization samples penalize the backdoor correlation between the trigger and the target class. (2) Trigger planting strategies that promote asymmetry and diversity. One may notice that penalization on the backdoor correlation induced by regularization samples can also greatly hurt the attack success rate (ASR). We alleviate this problem via asymmetric trigger planting strategies. As illustrated in Fig 2 , we apply weakened triggers when we construct regularization and payload samples for data poisoning, while the original standard trigger would only be used during test time to activate the backdoor. Conceptually, in this way, since test-time backdoor samples (with the standard trigger) contain stronger backdoor features than those of regularization samples (with weakened triggers), the test-time attack can well mitigate the counter force from regularization samples and still maintain a high ASR. Besides asymmetry, our design also promotes diversity of triggers during data poisoning -different poison samples could be stamped with different partial triggers, selected from a diverse set of trigger partitions. Intuitively, this diversity allows backdoor poison samples to scatter more diversely in the latent representation space, and can thus avoid being aggregated into an easy-to-identify cluster. In conclusion, the main contributions of this paper are four-fold. (1) We confirm that the latent separability assumption holds across a diverse set of backdoor poisoning attacks in the existing literature. (2) We reveal that this assumption could fail, leading to poor performance of defenses that explicitly base their designs on it. (3) We design some simple yet effective adaptive backdoor poi-



Figure 1: T-SNE visualization of latent separability characteristic on CIFAR-10. Each point in the plots corresponds to a training sample from the target class. Caption of each subplot specifies its corresponding poison strategy. To highlight the separation, all poison samples are denoted by red points, while clean samples correspond to blue points.

