DLP: DATA-DRIVEN LABEL-POISONING BACKDOOR ATTACK

Abstract

Backdoor attacks, which aim to disrupt or paralyze classifiers on specific tasks, are becoming an emerging concern in several learning scenarios, e.g., Machine Learning as a Service (MLaaS). Various backdoor attacks have been introduced in the literature, including perturbation-based methods, which modify a subset of training data; and clean-sample methods, which relabel only a proportion of training samples. Indeed, clean-sample attacks can be particularly stealthy since they never require modifying the samples at the training and test stages. However, the state-of-the-art clean-sample attack of relabelling training data based on their semantic meanings could be ineffective and inefficient in test performances due to heuristic selections of semantic patterns. In this work, we introduce a new type of clean-sample backdoor attack, named as DLP backdoor attack, allowing attackers to backdoor effectively, as measured by test performances, for an arbitrary backdoor sample size. The critical component of DLP is a data-driven backdoor scoring mechanism embedding in a multi-task formulation, which enables attackers to simultaneously perform well on the normal learning tasks and the backdoor tasks. Systematic empirical evaluations show the superior performance of the proposed DLP to state-of-the-art clean-sample attacks.

1. INTRODUCTION

The backdoor attack has been an emerging concern in several deep learning applications owing to their broad applicability and potentially dire consequences (Li et al., 2020) . In a high level, a backdoor attack implants triggers into a learning model to achieve two goals simultaneously: (1) to lead the backdoored model to behave maliciously on attacker-specified tasks with an active backdoor trigger, e.g., a camouflage patch as demonstrated in Fig. 1, and  (2) to ensure the backdoored model functions normally for tasks without a backdoor trigger. One popular framework is the perturbation-based backdoor attack (PBA) (Gu et al., 2017; Chen et al., 2017; Turner et al., 2019; Zhao et al., 2020; Doan et al., 2021a; b) . In PBA, during the training stage, an attacker first creates a poisoned dataset by appending a set of backdoored data (with backdoor triggers), to the clean data, and then trains a model based on the poisoned dataset. In the test stage, the attacker launches backdoor attacks by adding the same backdoor trigger to the clean test data. The requirement of accessing and modifying data, including both features and labels, during the training and test stages in PBA could be unrealistic under several applications. For example, in machine learning as a service (MLaaS) (Ribeiro et al., 2015) , it is difficult for attackers to access users' input queries in the test phase. Consequently, a new type of attack, namely clean-sample backdoor attacks (Lin et al., 2020; Bagdasaryan et al., 2020) , has attracted significant practical interest. In clean-sample backdoor attacks, the attacker changes labels instead of features in the training stage only as illustrated in Fig. 1 and summarized in Table 1 . The state-of-the-art (SOTA) clean-sample attack, known as the semantic backdoor attack (Bagdasaryan et al., 2020) , first looks for images with particular semantic meaning, then relabels all the training images with the semantic meaning, e.g., green car in the CIFAR10 dataset, to attacker-specified labels. Finally, the attacker trains a classifier based on the modified data. In the inference stage, no further operations are needed the attacker. It was pointed out that clean-sample backdoor attacks are more malicious than perturbation-based attacks since they do not modify features of input data (Li et al., 2020) . Nevertheless, the SOTA clean-sample method, namely the semantic backdoor attack, is possibly limited in terms of backdoor

