IN SEARCH OF SMOOTH MINIMA FOR PURIFYING BACKDOOR IN DEEP NEURAL NETWORKS

Abstract

The success of a deep neural network (DNN) heavily relies on the details of the training scheme; e.g., training data, architectures, hyper-parameters, etc. Recent backdoor attacks suggest that an adversary can take advantage of such training details and compromise the integrity of a DNN. Our studies show that a backdoor model is usually optimized to a bad local minima, i.e., sharper minima as compared to a benign model. Intuitively, a backdoor model can be purified by re-optimizing the model to a smoother minima through fine-tuning with a few clean validation data. However, fine-tuning all DNN parameters often requires huge computational cost and often results in sub-par clean test performance. To address this concern, we propose a novel backdoor purification technique-Natural Gradient Fine-tuning (NGF)-which focuses on removing backdoor by fine-tuning only one layer. Specifically, NGF utilizes a loss surface geometry-aware optimizer that can successfully overcome the challenge of reaching a smooth minima under a one-layer optimization scenario. To enhance the generalization performance of our proposed method, we introduce a clean data distribution-aware regularizer based on the knowledge of loss surface curvature matrix, i.e., Fisher Information Matrix. Extensive experiments show that the proposed method achieves state-of-the-art performance on a wide range of backdoor defense benchmarks: four different datasets-CIFAR10, GTSRB, Tiny-ImageNet, and ImageNet; 13 recent backdoor attacks, e.g., Blend, Dynamic, WaNet, ISSBA, etc.

1. INTRODUCTION

Training a deep neural network (DNN) with a fraction of poisoned or malicious data is often securitycritical since the model can successfully learn both clean and adversarial tasks equally well. This is prominent in scenarios where one outsources the DNN training to a vendor. In such scenarios, an adversary can mount backdoor attacks (Gu et al., 2019; Chen et al., 2017) through poisoning a portion of training samples so that the model will misclassify any sample with a particular trigger or pattern to an adversary-set label. Whenever a DNN is trained in such a manner, it becomes crucial to remove the effect of backdoor before deploying it for a real-world application. Different defense techniques (Liu et al., 2018; Wang et al., 2019; Wu & Wang, 2021; Li et al., 2021a; Zheng et al., 2022) have been proposed for purifying backdoor. Techniques such as fine-pruning (Liu et al., 2018) and adversarial neural pruning (Wu & Wang, 2021) require a long training time due to iterative searching criteria. Furthermore, the purification performance deteriorates significantly as the attacks get stronger. In this work, we explore the backdoor insertion and removal phenomena from the DNN optimization point of view. Unlike a benign model, a backdoor model is forced to learn two different data distributions: clean data distribution and poisoned/trigger data distribution. Having to learn both distributions, backdoor model optimization usually leads to a bad local minima or sharper minima w.r.t. clean distribution. We claim that backdoor can be removed by re-optimizing the model to a smoother minima. One easy re-optimization scheme could be simple DNN weights fine-tuning with a few clean validation samples. However, fine-tuning all DNN parameters often requires huge computational cost and may result in sub-par clean test performance after purification. Therefore, we intend to fine-tune only one layer to effectively remove the backdoor. Fine-tuning only one layer creates a shallow network scenario where SGD-based optimization becomes a bit challenging. Choromanska et al. (2015) claims that the probability of finding bad local minima or poor quality solution increases as the network size decreases. Even though there are

