CLEAN-IMAGE BACKDOOR: ATTACKING MULTI-LABEL MODELS WITH POISONED LABELS ONLY

Abstract

Multi-label models have been widely used in various applications including image annotation and object detection. The fly in the ointment is its inherent vulnerability to backdoor attacks due to the adoption of deep learning techniques. However, all existing backdoor attacks exclusively require to modify training inputs (e.g., images), which may be impractical in real-world applications. In this paper, we aim to break this wall and propose the first clean-image backdoor attack, which only poisons the training labels without touching the training samples. Our key insight is that in a multi-label learning task, the adversary can just manipulate the annotations of training samples consisting of a specific set of classes to activate the backdoor. We design a novel trigger exploration method to find convert and effective triggers to enhance the attack performance. We also propose three target label selection strategies to achieve different goals. Experimental results indicate that our clean-image backdoor can achieve a 98% attack success rate while preserving the model's functionality on the benign inputs. Besides, the proposed clean-image backdoor can evade existing state-of-the-art defenses.

1. INTRODUCTION

Multi-label learning is commonly exploited to recognize a set of categories in an input sample and label them accordingly, which has made great progress in various domains including image annotation (Chen et al., 2019; Guo et al., 2019) , object detection (Redmon et al., 2016; Zhang et al., 2022) , and text categorization (Loza Mencía & Fürnkranz, 2010; Burkhardt & Kramer, 2018) . Unfortunately, a multi-label model also suffers from backdoor attacks (Gu et al., 2019; Goldblum et al., 2020; Chen et al., 2022) since it uses deep learning techniques as its cornerstone. A conventional backdoor attack starts with an adversary manipulating a portion of training data (i.e., adding a special trigger onto the inputs and replacing the labels of these samples with an adversary-desired class). Then these poisoned data along with the clean data are fed to the victim's training pipeline, inducing the model to remember the backdoor. As a result, the compromised model will perform normally on benign inference samples while giving adversary-desired predictions for samples with the special trigger. Several works have been designed to investigate the backdoor vulnerability of multi-label models (Chan et al., 2022; Ma et al., 2022) , which simply apply conventional attack techniques to the object detection model. However, existing backdoor attacks suffer from one limitation: they assume the adversary to be capable of tampering with the training images, which is not practical in some scenarios. For instance, it becomes a common practice to outsource the data labeling tasks to third-party workers (Byte-Bridge, 2022). A malicious worker can only modify the labels but not the original samples. Thus he cannot inject backdoors to the model using prior approaches. Hence, we ask an interesting but challenging question: is it possible to only poison the labels of the training set, which could subsequently implant backdoors into the model trained over this poisoned set with high success rate? Our answer is in the affirmative. Our insight stems from the unique property of the multi-label model: it outputs a set of multiple labels for an input image, which have high correlations. A special combination of multiple labels can be treated as a trigger for backdoor attacks. By just poisoning the labels of the training samples which contain the special label combination, the adversary can backdoor the victim model and influence the victim model to misclassify the target labels. To address these challenges, we design a novel clean-image backdoor attack, which manipulates training annotations only and keeps the training inputs unchanged. Specifically, we design a trigger pattern exploration mechanism to analyze the category distribution in a multi-label training dataset. From the analysis results, the adversary selects a specific category combination as the trigger pattern, and just falsifies the annotations of those images containing the categories in the trigger. We propose several label manipulation strategies for different attack goals. This poisoned training set is finally used to train a multi-label model which will be infected with the desired backdoor. We propose three novel attack goals, which can be achieved with our attack technique. The adversary can cause the infected model to (1) miss an existing object (object disappearing); (2) misrecognize an non-existing object (object appearing); (3) misclassify an existing object (object misclassification). Figure 1 shows the examples of the three attacks. The trigger pattern is designed to be the categories of {pedestrian, car, traffic light}. Given a clean image containing these categories, by injecting different types of backdoors, the victim model will (1) fail to identify the "traffic light", (2) identify a "truck" which is not in the image, and (3) misclassify the "car" in the image as a "truck". We implement the proposed clean-image backdoor attack against two types of multi-label classification approaches and three popular benchmark datasets. Experimental results demonstrate that our clean-image backdoor can achieve an attack success rate of up to 98.2% on the images containing the trigger pattern. Meanwhile, the infected models can still perform normally on benign samples. In summary, we make the following contributions in this paper: • We propose the first clean-image backdoor attack against multi-label models and design a novel label-poisoning approach to implant backdoors. • We propose a new type of backdoor trigger composed of category combination, which is more stealthy and effective in the more strict and realistic threat model. • We show that our clean-image backdoor can achieve a high attack success rate on different datasets and models. Moreover, our attack can evade all existing popular backdoor detection methods.

2. BACKGROUND

Multi-label Learning. Multi-label learning has been widely applied in various tasks like text categorization (Loza Mencía & Fürnkranz, 2010; Burkhardt & Kramer, 2018) , object detection (Redmon et al., 2016; Zhang et al., 2022) and image annotation (Chen et al., 2019; Guo et al., 2019) . Among them, image annotation (a.k.a. multi-label classification) has drawn increased research attention. It aims to recognize and label multiple objects in one image correctly. Early work transforms a multilabel task into multiple independent single-label tasks (Wei et al., 2015) . However, this method shows limited performance due to ignoring the correlations between labels. Some works apply recurrent neural network to model the correlations between labels and achieve significant performance improvements (Wang et al., 2016; Yazici et al., 2020) . Following these, researchers explore and exploit the correlation between labels with the Graph Convolutional Network (GCN) (Chen et al., 2019; Wang et al., 2020) . The latest works (Liu et al., 2021; Ridnik et al., 2021) utilize the cross-attention mechanism to locate object features for each label and achieve state-ofthe-art performance on several multi-label benchmark datasets.



Figure 1: Illustration of our attacksHowever, there are several challenges to achieve such an attack under the constraint that the adversary can only change training labels but not inputs. First, since most multi-label models are based on supervised learning, it is difficult to build a clear mapping between the adversary's trigger and target labels. Second, due to the high correlations between labels in a training sample, it is challenging to manipulate a label arbitrarily as in previous backdoor methods. Third, training data in multi-label tasks are grossly unbalanced(Gibaja & Ventura, 2015). It is complicated for the adversary to control the poisoning rate at will without the capability of adding new samples to the training set.

