EFFICIENT AND STEALTHY BACKDOOR ATTACK TRIG-GERS ARE CLOSE AT HAND

Abstract

A backdoor attack aims to inject a backdoor into a deep model so that the model performs normally on benign samples while maliciously predicting the input as the attacker-defined target class when the backdoor is activated by a predefined trigger pattern. Most existing backdoor attacks use a pattern that rarely occurs in benign data as the trigger pattern. In this way, the impact of the attack on the label prediction of benign data can be mitigated. However, this practice also results in the attack being defended against with little performance degradation on benign data by preventing the trigger pattern from being activated. In this work, we present a new attack strategy to solve this dilemma. Unlike the conventional strategy, our strategy extracts the trigger pattern from benign training data, which frequently occurs in samples of the target class but rarely occurs in samples of the other classes. Compared with the prevailing strategy, our proposed strategy has two advantages. First, it can improve the efficiency of the attack because learning on benign samples of the target class can facilitate the fitting of the trigger pattern. Second, it increases the difficulty or cost of identifying the trigger pattern and preventing its activation, since many benign samples of the target class contain the trigger pattern. We empirically evaluate our strategy on four benchmark datasets. The experimental studies show that attacks performed with our strategy can achieve much better performance when poisoning only 0.1% or more of the training data, and can achieve better performance against several benchmark defense algorithms.

1. INTRODUCTION

Backdoor attack, also known as Trojan horse attacks, has become an increasing security threat in recent years, attracting many research interests (Chen et al., 2017; Doan et al., 2021) . The attack aims to inject a backdoor into a deep model so that the model behaves normally on benign samples, while its predictions are maliciously and consistently changed to a predefined target class (or classes) when the backdoors are activated. Currently, poisoning training samples is the most straightforward and widely adopted method for performing backdoor attacks. Depending on whether the labels of the poisoned samples are changed or not, existing backdoor attacks can be roughly divided into poison-label backdoor attacks (Gu et al., 2017; Barni et al., 2019; Nguyen & Tran, 2020; Liu et al., 2020; Qi et al., 2021) and clean-label backdoor attacks (Turner et al., 2019; Saha et al., 2020) . In this work, we follow the practice of poison-label backdoor attacks, which are much more efficient than clean-label ones. To perform a poison-label backdoor attack, it first selects a small number (a smaller number means a smaller impact on the performance of the benign data and a lower probability of being inspected by the developer) of benign samples from the training set, inserts a predefined trigger pattern into the inputs, and changes their labels to the target class (or classes). It then reinserts the poisoned samples into the training set and provides the resulting data to the victims to train the model with. In attack mode, it manipulates the victim model to produce the intended target class by injecting the predefined trigger pattern into the inputs. Existing backdoor attacks typically use a pattern that rarely occurs in benign data as the trigger pattern. For example, Gu et al. ( 2017) used stickers and checkerboards as trigger patterns for image data, Sun (2020) used special characters, words, and phrases as trigger patterns for text data. This strategy can prevent the attack from being falsely activated on benign data and mitigate the impact of the attack on the label predictions of benign samples. However, it also results in low-effort detection and mitigation of the attack by preventing activation of the trigger pattern. As far as we know, most existing backdoor attack defense algorithms build on this property and achieve great success (Chen et al., 2019; Wang et al., 2019; Zhao et al., 2020; Xu et al., 2020; Yang et al., 2021; Huang et al., 2022a; Mu et al., 2022) . In this work, we introduce a new attack strategy to solve this dilemma. Instead of using a rare pattern as the trigger pattern, we extract the trigger pattern from benign data that frequently occurs in samples of the target class but rarely occurs in samples of non-target classes. Accordingly, we introduce a framework to extract the trigger pattern and insert it into benign inputs to perform data poisoning. The proposed strategy has two advantages over the prevailing one. First, it is more efficient since the trigger pattern comes from the target class. Training on benign samples of the target class will help the fitting of the trigger pattern. Second, it becomes harder to detect and defend against the attack because many benign samples of the target class contain the trigger pattern. It is difficult to distinguish the poisoned samples from benign samples of the target class using the trigger pattern, and disabling the activation of the trigger pattern will degrade performance of these benign target class samples. To evaluate the benefits of our proposed strategy, we conducted experiments on four widely studied datasets. The empirical results show that attacks performed with our strategy can achieve better attack performance with less poisoned data especially when the poisoned data size is small. Moreover, they can escape the defenses of several benchmark defense algorithms, while attacks performed with the conventional strategies often fail to do so. The main contributions of this work can be summarized in three points. 1) A new attack strategy for designing the trigger pattern is proposed. 2) An effective technique for extracting the trigger pattern and injecting it into benign data is proposed. 3) Experiments are conducted on four widely studied datasets, and the results show that our strategy can improve both attack efficiency and stealthiness.

2. BACKGROUND & RELATED WORK

2.1 BACKDOOR ATTACK Threat Model. Backdoor attacks aim to insert backdoors into deep models so that victim models will work genuinely on benign inputs but misbehave when a specific trigger pattern occurs. This is an emerging area of research that raises security concerns about training with third-party resources. A backdoor attack can be performed at multiple stages of the victim model (Li et al., 2022a; Jia et al., 2022; Chen et al., 2022) . In this work, we follow the widely used training data poisoning setting, where attackers only manipulate the training data, but not the model, training schedule, or inference pipeline. In this case, the attackers spread the attack by sending the poisoned data to the victims or publishing it on the Internet and waiting for the victims. Formally, let x ∈ X denote the input, y ∈ Y denote the output, D = {(x 1 , y 1 ), • • • , (x n , y n )} denote the benign training dataset. To perform a backdoor attack, the attacker first selects a small number of benign samples from D and then injects the predefined trigger pattern into each selected input and sets its label as follows: x = B(x); ỹ = T (y), where B denotes the backdoor injection function and T the relabel function. In the usual all-to-one poison-label attack setting (we apply the setting in this work), T (y) ≡ c t , where c t denotes the label of the attacker-specific target class. Then, the attacker mixes the poisoned samples, ( x, ỹ), with the remaining benign training samples and provides the resulting dataset, D, to the victims. Without knowing the training data being poisoned, the victims train a model, f vim , on D, and provide a classification service using f vim . In attack mode, the attacker injects the trigger pattern into any input x that they want the victim model to behave as follows: f vim (x) = y; f vim (B(x)) = T (y). Previous Poison-Label Backdoor Attacks. Existing poison-label backdoor attacks typically use a pattern that rarely occurs in benign data as the trigger pattern so as not to degrade the model's performance on benign data (Liu et al., 2017; Chen et al., 2017; Barni et al., 2019; Liu et al., 2020) .

