CLEAN-IMAGE BACKDOOR: ATTACKING MULTI-LABEL MODELS WITH POISONED LABELS ONLY

Abstract

Multi-label models have been widely used in various applications including image annotation and object detection. The fly in the ointment is its inherent vulnerability to backdoor attacks due to the adoption of deep learning techniques. However, all existing backdoor attacks exclusively require to modify training inputs (e.g., images), which may be impractical in real-world applications. In this paper, we aim to break this wall and propose the first clean-image backdoor attack, which only poisons the training labels without touching the training samples. Our key insight is that in a multi-label learning task, the adversary can just manipulate the annotations of training samples consisting of a specific set of classes to activate the backdoor. We design a novel trigger exploration method to find convert and effective triggers to enhance the attack performance. We also propose three target label selection strategies to achieve different goals. Experimental results indicate that our clean-image backdoor can achieve a 98% attack success rate while preserving the model's functionality on the benign inputs. Besides, the proposed clean-image backdoor can evade existing state-of-the-art defenses.

1. INTRODUCTION

Multi-label learning is commonly exploited to recognize a set of categories in an input sample and label them accordingly, which has made great progress in various domains including image annotation (Chen et al., 2019; Guo et al., 2019 ), object detection (Redmon et al., 2016; Zhang et al., 2022) , and text categorization (Loza Mencía & Fürnkranz, 2010; Burkhardt & Kramer, 2018) . Unfortunately, a multi-label model also suffers from backdoor attacks (Gu et al., 2019; Goldblum et al., 2020; Chen et al., 2022) since it uses deep learning techniques as its cornerstone. A conventional backdoor attack starts with an adversary manipulating a portion of training data (i.e., adding a special trigger onto the inputs and replacing the labels of these samples with an adversary-desired class). Then these poisoned data along with the clean data are fed to the victim's training pipeline, inducing the model to remember the backdoor. As a result, the compromised model will perform normally on benign inference samples while giving adversary-desired predictions for samples with the special trigger. Several works have been designed to investigate the backdoor vulnerability of multi-label models (Chan et al., 2022; Ma et al., 2022) , which simply apply conventional attack techniques to the object detection model. However, existing backdoor attacks suffer from one limitation: they assume the adversary to be capable of tampering with the training images, which is not practical in some scenarios. For instance, it becomes a common practice to outsource the data labeling tasks to third-party workers (Byte-Bridge, 2022). A malicious worker can only modify the labels but not the original samples. Thus he cannot inject backdoors to the model using prior approaches. Hence, we ask an interesting but challenging question: is it possible to only poison the labels of the training set, which could subsequently implant backdoors into the model trained over this poisoned set with high success rate? Our answer is in the affirmative. Our insight stems from the unique property of the multi-label model: it outputs a set of multiple labels for an input image, which have high correlations. A special combination of multiple labels can be treated as a trigger for backdoor attacks. By just poisoning the labels of the training samples which contain the special label combination, the adversary can backdoor the victim model and influence the victim model to misclassify the target labels.

