NEURAL ATTENTION DISTILLATION: ERASING BACK-DOOR TRIGGERS FROM DEEP NEURAL NETWORKS

Abstract

Deep neural networks (DNNs) are known vulnerable to backdoor attacks, a training time attack that injects a trigger pattern into a small proportion of training data so as to control the model's prediction at the test time. Backdoor attacks are notably dangerous since they do not affect the model's performance on clean examples, yet can fool the model to make incorrect prediction whenever the trigger pattern appears during testing. In this paper, we propose a novel defense framework Neural Attention Distillation (NAD) to erase backdoor triggers from backdoored DNNs. NAD utilizes a teacher network to guide the finetuning of the backdoored student network on a small clean subset of data such that the intermediate-layer attention of the student network aligns with that of the teacher network. The teacher network can be obtained by an independent finetuning process on the same clean subset. We empirically show, against 6 state-of-the-art backdoor attacks, NAD can effectively erase the backdoor triggers using only 5% clean training data without causing obvious performance degradation on clean examples. Our code is available at https://github.com/bboylyg/NAD.

1. INTRODUCTION

In recent years, deep neural networks (DNNs) have been widely adopted into many important realworld and safety-related applications. Nonetheless, it has been demonstrated that DNNs are prone to potential threats in multiple phases of their life cycles. A type of well-studied adversary is called the adversarial attack (Szegedy et al., 2013; Goodfellow et al., 2014; Ma et al., 2018; Jiang et al., 2019; Wang et al., 2019b; 2020; Duan et al., 2020; Ma et al., 2020) . At test time, state-of-the-art DNN models can be fooled into making incorrect predictions with small adversarial perturbations (Madry et al., 2018; Carlini & Wagner, 2017; Wu et al., 2020; Jiang et al., 2020) . DNNs are also known to be vulnerable to another type of adversary known as the backdoor attack. Recently, backdoor attacks have gained more attention due to the fact it could be easily executed in real scenarios (Gu et al., 2019; Chen et al., 2017) . Intuitively, backdoor attack aims to trick a model into learning a strong correlation between a trigger pattern and a target label by poisoning a small proportion of the training data. Even trigger patterns as simple as a single pixel (Tran et al., 2018) or a black-white checkerboard (Gu et al., 2019) can grant attackers full authority to control the model's behavior. Backdoor attacks can be notoriously perilous for several reasons. First, backdoor data could infiltrate the model on numerous occasions including training models on data collected from unreliable sources or downloading pre-trained models from untrusted parties. Additionally, with the invention of more complex triggers such as natural reflections (Liu et al., 2020b) or invisible noises (Liao et al., 2020; Li et al., 2019; Chen et al., 2019c) , it is much harder to catch backdoor examples at test time. On top of that, once the backdoor triggers have been embedded into the target model, it is hard to completely eradicate their malicious effects by standard finetuning or neural pruning (Yao et al., 2019; Li et al., 2020b; Liu et al., 2020b) . A recent work also proposed the mode connectivity repair (MCR) to remove backdoor related neural paths from the network (Zhao et al., 2020a) . On the other hand, even though detection-based approaches have been performing fairly well on identifying backdoored models (Chen et al., 2019a; Tran et al., 2018; Chen et al., 2019b; Kolouri et al., 2020) , the identified backdoored models still need to be purified by backdoor erasing techniques. In this work, we propose a novel backdoor erasing approach, Neural Attention Distillation (NAD), for the backdoor defense of DNNs. NAD is a distillation-guided finetuning process motivated by the ideas of knowledge distillation (Bucilua et al., 2006; Hinton et al., 2014) and neural attention transfer (Zagoruyko & Komodakis, 2017; Huang & Wang, 2017; Heo et al., 2019) . Specifically, NAD utilizes a teacher network to guide the finetuning of a backdoored student network on a small subset of clean training data so that the intermediate-layer attention of the student network is wellaligned with that of the teacher network. The teacher network can be obtained from the backdoored student network via standard finetuning using the same clean subset of data. We empirically show that such an attention distillation step is far more effective in removing the network's attention on the trigger pattern in comparison to the standard finetuning or the neural pruning methods. Our main contributions can be summarized as follows: • We propose a simple yet powerful backdoor defense approach called Neural Attention Distillation (NAD). NAD is by far the most comprehensive and effective defense against a wide range of backdoor attacks. • We suggest that attention maps can be used as an intuitive way to evaluate the performance of backdoor defense mechanisms due to their ability to highlight backdoored regions in a network's topology. Li et al., 2020c; Nguyen & Tran, 2020) . Trigger patterns may also appear in the form of natural reflection (Liu et al., 2020b) or human imperceptible noise (Liao et al., 2020; Li et al., 2019; Chen et al., 2019c) , making them more stealthy and hard to be detected even by human inspection. Recent studies have shown that a backdoor attack can be conducted even without access to the training data (Liu et al., 2018b) or in federated learning (Xie et al., 2019; Bagdasaryan et al., 2020; Lyu et al., 2020) . Surveys on backdoor attacks can be found in (Li et al., 2020a; Lyu et al., 2020) .

2. RELATED WORK

Backdoor Defense. Existing works primarily focused on two types of strategies to defend against backdoor attacks. Depending on the methodologies, a backdoor defense can be either backdoor detection or trigger erasing. Detection-based methods aim at identifying the existence of backdoor adversaries in the underlying model (Wang et al., 2019a; Kolouri et al., 2020) or filtering the suspicious samples from input data for re-training (Tran et al., 2018; Gao et al., 2019; Chen et al., 2019b) . Although these methods have been performing fairly well on distinguishing whether a model has been poisoned, the backdoor effects still remain in the backdoored model. On the other hand, Erasing-based methods aim to directly purify the backdoored model by removing the malicious impacts caused by the backdoor triggers, while simultaneously maintain the model's overall performance on clean data. A straightforward approach is to directly finetune the backdoored model on a small subset of clean data, which is typically available to the defender (Liu et al., 2018b) . Nonetheless, training on only a small clean subset can lead to catastrophic forgetting (Kirkpatrick et al., 2017) , where the model overfits to the subset and consequently causes substantial performance degradation. Fine-pruning (Liu et al., 2018a) alleviates this issue by pruning less informative neurons prior to finetuning the model. In such a way, the standard finetuning process can effectively erase the impact of backdoor triggers without significantly deteriorating the model's overall performance. WILD (Liu et al., 2020a) proposed to utilize data augmentation techniques alongside distribution alignment between clean samples and their occluded versions to remove backdoor triggers from DNNs. Other techniques such as regularization (Truong et al., 2020) and mode connectivity repair (Zhao et al., 2020a) have also been explored to mitigate backdoor attacks. While promising, existing backdoor erasing methods still suffer from a number of drawbacks. Efficient methods can be evaded by the latest attacks (Liu et al., 2018a; 2020b) , whereas effective methods are typically computationally expensive (Zhao et al., 2020a) . In this work, we propose a novel finetuning-based backdoor erasing approach that is not only effective but efficient against a wide range of backdoor attacks.



† Correspondence to: Xixiang Lyu (xxlv@mail.xidian.edu.cn), Xingjun Ma (daniel.ma@deakin.edu.au)

