EFFICIENT AND STEALTHY BACKDOOR ATTACK TRIG-GERS ARE CLOSE AT HAND

Abstract

A backdoor attack aims to inject a backdoor into a deep model so that the model performs normally on benign samples while maliciously predicting the input as the attacker-defined target class when the backdoor is activated by a predefined trigger pattern. Most existing backdoor attacks use a pattern that rarely occurs in benign data as the trigger pattern. In this way, the impact of the attack on the label prediction of benign data can be mitigated. However, this practice also results in the attack being defended against with little performance degradation on benign data by preventing the trigger pattern from being activated. In this work, we present a new attack strategy to solve this dilemma. Unlike the conventional strategy, our strategy extracts the trigger pattern from benign training data, which frequently occurs in samples of the target class but rarely occurs in samples of the other classes. Compared with the prevailing strategy, our proposed strategy has two advantages. First, it can improve the efficiency of the attack because learning on benign samples of the target class can facilitate the fitting of the trigger pattern. Second, it increases the difficulty or cost of identifying the trigger pattern and preventing its activation, since many benign samples of the target class contain the trigger pattern. We empirically evaluate our strategy on four benchmark datasets. The experimental studies show that attacks performed with our strategy can achieve much better performance when poisoning only 0.1% or more of the training data, and can achieve better performance against several benchmark defense algorithms.

1. INTRODUCTION

Backdoor attack, also known as Trojan horse attacks, has become an increasing security threat in recent years, attracting many research interests (Chen et al., 2017; Doan et al., 2021) . The attack aims to inject a backdoor into a deep model so that the model behaves normally on benign samples, while its predictions are maliciously and consistently changed to a predefined target class (or classes) when the backdoors are activated. Currently, poisoning training samples is the most straightforward and widely adopted method for performing backdoor attacks. Depending on whether the labels of the poisoned samples are changed or not, existing backdoor attacks can be roughly divided into poison-label backdoor attacks (Gu et al., 2017; Barni et al., 2019; Nguyen & Tran, 2020; Liu et al., 2020; Qi et al., 2021) and clean-label backdoor attacks (Turner et al., 2019; Saha et al., 2020) . In this work, we follow the practice of poison-label backdoor attacks, which are much more efficient than clean-label ones. To perform a poison-label backdoor attack, it first selects a small number (a smaller number means a smaller impact on the performance of the benign data and a lower probability of being inspected by the developer) of benign samples from the training set, inserts a predefined trigger pattern into the inputs, and changes their labels to the target class (or classes). It then reinserts the poisoned samples into the training set and provides the resulting data to the victims to train the model with. In attack mode, it manipulates the victim model to produce the intended target class by injecting the predefined trigger pattern into the inputs. Existing backdoor attacks typically use a pattern that rarely occurs in benign data as the trigger pattern. For example, Gu et al. ( 2017) used stickers and checkerboards as trigger patterns for image data, Sun (2020) used special characters, words, and phrases as trigger patterns for text data. This strategy can prevent the attack from being falsely activated on benign data and mitigate the impact of 1

