IN SEARCH OF SMOOTH MINIMA FOR PURIFYING BACKDOOR IN DEEP NEURAL NETWORKS

Abstract

The success of a deep neural network (DNN) heavily relies on the details of the training scheme; e.g., training data, architectures, hyper-parameters, etc. Recent backdoor attacks suggest that an adversary can take advantage of such training details and compromise the integrity of a DNN. Our studies show that a backdoor model is usually optimized to a bad local minima, i.e., sharper minima as compared to a benign model. Intuitively, a backdoor model can be purified by re-optimizing the model to a smoother minima through fine-tuning with a few clean validation data. However, fine-tuning all DNN parameters often requires huge computational cost and often results in sub-par clean test performance. To address this concern, we propose a novel backdoor purification technique-Natural Gradient Fine-tuning (NGF)-which focuses on removing backdoor by fine-tuning only one layer. Specifically, NGF utilizes a loss surface geometry-aware optimizer that can successfully overcome the challenge of reaching a smooth minima under a one-layer optimization scenario. To enhance the generalization performance of our proposed method, we introduce a clean data distribution-aware regularizer based on the knowledge of loss surface curvature matrix, i.e., Fisher Information Matrix. Extensive experiments show that the proposed method achieves state-of-the-art performance on a wide range of backdoor defense benchmarks: four different datasets-CIFAR10, GTSRB, Tiny-ImageNet, and ImageNet; 13 recent backdoor attacks, e.g., Blend, Dynamic, WaNet, ISSBA, etc.

1. INTRODUCTION

Training a deep neural network (DNN) with a fraction of poisoned or malicious data is often securitycritical since the model can successfully learn both clean and adversarial tasks equally well. This is prominent in scenarios where one outsources the DNN training to a vendor. In such scenarios, an adversary can mount backdoor attacks (Gu et al., 2019; Chen et al., 2017) through poisoning a portion of training samples so that the model will misclassify any sample with a particular trigger or pattern to an adversary-set label. Whenever a DNN is trained in such a manner, it becomes crucial to remove the effect of backdoor before deploying it for a real-world application. Different defense techniques (Liu et al., 2018; Wang et al., 2019; Wu & Wang, 2021; Li et al., 2021a; Zheng et al., 2022) have been proposed for purifying backdoor. Techniques such as fine-pruning (Liu et al., 2018) and adversarial neural pruning (Wu & Wang, 2021) require a long training time due to iterative searching criteria. Furthermore, the purification performance deteriorates significantly as the attacks get stronger. In this work, we explore the backdoor insertion and removal phenomena from the DNN optimization point of view. Unlike a benign model, a backdoor model is forced to learn two different data distributions: clean data distribution and poisoned/trigger data distribution. Having to learn both distributions, backdoor model optimization usually leads to a bad local minima or sharper minima w.r.t. clean distribution. We claim that backdoor can be removed by re-optimizing the model to a smoother minima. One easy re-optimization scheme could be simple DNN weights fine-tuning with a few clean validation samples. However, fine-tuning all DNN parameters often requires huge computational cost and may result in sub-par clean test performance after purification. Therefore, we intend to fine-tune only one layer to effectively remove the backdoor. Fine-tuning only one layer creates a shallow network scenario where SGD-based optimization becomes a bit challenging. Choromanska et al. (2015) claims that the probability of finding bad local minima or poor quality solution increases as the network size decreases. Even though there are good-quality solutions, it usually requires exponentially long time to find those minima (Choromanska et al., 2015) . As a remedy to this, we opt to use a curvature aware optimizer, Natural Gradient Decent (NGD), that has higher probability of escaping the bad local minima as well as faster convergence rate, specifically in the shallow network scenario (Amari, 1998; Martens & Grosse, 2015) . To this end, we propose a novel backdoor purification technique-Natural Gradient Fine-tuning (NGF)which focuses on removing backdoor through fine-tuning only one layer. However, straightforward application of NGF with simple cross-entropy (CE) loss may result in poor clean test performance. To boost this performance, we use a clean distribution-aware regularizer that prioritizes the update of parameters sensitive to clean data distribution. Our proposed method achieves SOTA performance in a wide range of benchmarks, e.g., four different datasets including ImageNet, 13 recent backdoor attacks etc. Our contributions can be summarized as follows: • We analyze the loss surface characteristics of a DNN during backdoor insertion and purification processes. Our analysis shows that the optimization of a backdoor model leads to a bad local minima or sharper minima compared to a benign model. We argue that backdoor can be purified by re-optimizing the model to a smoother minima and simple fine-tuning can be a viable way for that. To the best of our knowledge, this is the first work that studies the correlation between loss-surface smoothness and backdoor purification. • We conduct additional studies on backdoor purification process while fine-tuning different parts of a DNN. We observe that SGD-based one-layer fine-tuning fails to escape bad local minima and a loss surface geometry-aware optimizer can be an easy fix to this. • We propose a novel backdoor purification technique based on Natural Gradient Fine-tuning (NGF). In addition, we employ a clean distribution-aware regularizer to boost the clean test performance of our proposed method. NGF outperforms recent SOTA methods in a wide range of benchmarks. Backdoor Defenses: Existing backdoor defense methods can be categorized into backdoor detection or purifying techniques. Detection based defenses include trigger synthesis approach (Wang et al., 2019; Qiao et al., 2019; Guo et al., 2020; Shen et al., 2021; Dong et al., 2021; Guo et al., 2021; Xiang et al., 2022; Tao et al., 2022) , or malicious samples filtering based techniques (Tran et al., 2018; Gao et al., 2019; Chen et al., 2019) & Wang, 2021) prune neurons through backdoor sensitivity analysis using adversarial search on the parameter space. Instead, we propose a simple one-layer fine-tuning based defense that is both fast and highly effective. To remove backdoor, our proposed method revisits the DNN fine-tuning paradigm from a novel point of view.



Backdoor triggers can exist in the form of dynamic patterns(Li et al., 2020), a single pixel(Tran et al., 2018), sinusoidal strips (Barni et al., 2019), human imperceptible noise(Zhong  et al., 2020), natural reflection (Liu et al., 2020), adversarial patterns (Zhang et al., 2021), blending backgrounds(Chen et al., 2017), etc. Based on target labels, existing backdoor attacks can generally be classified as poison-label or clean-label backdoor attacks. In poison-label backdoor attack, the target label of the poisoned sample is different from its ground-truth label, e.g.,BadNets (Gu et al.,  2019), Blended attack (Chen et al., 2017), SIG attack (Barni et al., 2019), WaNet (Nguyen & Tran,  2021), Trojan attack (Liu et al., 2017), and BPPA (Wang et al., 2022). Contrary to the poison-label attack, clean-label backdoor attack doesn't change the label of the poisoned sample(Turner et al.,  2018; Huang et al., 2022; Zhao et al., 2020b). Recently, Saha et al. (2022)  studied backdoor attacks on self-supervised learning.

. However, these methods only detect the existence of backdoor without removing it. Backdoor purification defenses can be further classified as training time defenses and inference time defenses. Training time defenses include model reconstruction approach (Zhao et al.Steinhardt et al., 2017). Pruning-based approaches are typically based on model vulnerabilities to backdoor attacks. For example, MCR (Zhao et al., 2020a) and CLP (Zheng et al., 2022) analyzed node connectivity and channel Lipschitz constant to detect backdoor vulnerable neurons. ANP (Wu

