DO WE REALLY NEED LABELS FOR BACKDOOR DEFENSE?

Abstract

Since training a model from scratch always requires massive computational resources recently, it has become popular to download pre-trained backbones from third-party platforms and deploy them in various downstream tasks. While providing some convenience, it also introduces potential security risks like backdoor attacks, which lead to target misclassification for any input image with a specifically defined trigger (i.e., backdoored examples). Current backdoor defense methods always rely on clean labeled data, which indicates that safely deploying the pre-trained model in downstream tasks still demands these costly or hard-to-obtain labels. In this paper, we focus on how to purify a backdoored backbone with only unlabeled data. To evoke the backdoor patterns without labels, we propose to leverage the unsupervised contrastive loss to search for backdoors in the feature space. Surprisingly, we find that we can mimic backdoored examples with adversarial examples crafted by contrastive loss, and erase them with adversarial finetuning. Thus, we name our method as Contrastive Backdoor Defense (CBD). Against several backdoored backbones from both supervised and self-supervised learning, extensive experiments demonstrate our unsupervised method achieves comparable or even better defense compared to these supervised backdoor defense methods. Thus, our method allows practitioners to safely deploy pre-trained backbones on downstream tasks without extra labeling costs.

1. INTRODUCTION

While deep neural networks (DNNs) have achieved promising performance on various tasks, including computer vision (He et al., 2016) and natural language processing (Floridi & Chiriatti, 2020) , their success heavily relies on a huge amount of data, massive computational resources, and carefully tuning of hyper-parameters. Thus, it becomes popular to download a pre-trained backbone and deploy it on several downstream tasks in recent years (Newell & Deng, 2020; Tan et al., 2018; He et al., 2019) . These backbones can be trained in any training paradigms, including supervised learning and self-supervised learning (Chen et al., 2020; He et al., 2022; Gidaris et al., 2018) , and then be open-sourced on third-party platforms. While providing convenience, they also bring potential risks such as backdoor attacks. Numerous works (Gu et al., 2017; Nguyen & Tran, 2021; Turner et al., 2019) pointed out this threat easily occurs in supervised learning, and recent studies (Saha et al., 2022; Jia et al., 2022) started to pay attention to backdoor attacks in self-supervised learning. Specifically, a backdoored DNN always predicts a predefined label for any input image with a specific trigger. For example, a traffic sign recognition system based on a backdoored backbone may always predict the "STOP" sign as "GO STRAIGHT" in the presence of a specific pattern, which causes severe security problems. To address this security issue, many defense methods (Zeng et al., 2022; Wang et al., 2019; Wu & Wang, 2021) are proposed. Unfortunately, almost all methods focus on backdoor attacks inside supervised learning DNNs by building a classification-based loss to defend. In the popular deployment scheme from the pre-trained backbone to downstream tasks, the practitioners might have few costly labeled data, fail to obtain a classifier head (e.g., a self-supervised backbone) to compare with the true label, or feel difficult to design a classification-based loss (e.g., tasks for detection or segmentation). To break through these restrictions, we first consider the following question: Do we really need labels for backdoor defense?

