RETHINKING BACKDOOR DATA POISONING ATTACKS IN THE CONTEXT OF SEMI-SUPERVISED LEARNING Anonymous authors Paper under double-blind review

Abstract

Semi-supervised learning methods can train high-accuracy machine learning models with a fraction of the labeled training samples required for traditional supervised learning. Such methods do not typically involve close review of the unlabeled training samples, making them tempting targets for data poisoning attacks. In this paper we investigate the vulnerabilities of semi-supervised learning methods to backdoor data poisoning attacks on the unlabeled samples. We show that a simple poisoning attack that influences the distribution of the poisoned samples' predicted labels is highly effective -achieving an average attack success rate of 93.6%. We introduce a generalized attack framework targeting semi-supervised learning methods to better understand and exploit their limitations and to motivate future defense strategies.

1. INTRODUCTION

Machine learning models have achieved high classification accuracy through the use of large, labeled datasets. However, the creation of diverse datasets with supervised labels is time-consuming and costly. In recent years, semi-supervised learning methods have been introduced which train models using a small set of labeled data and a large set of unlabeled data. These models achieve comparable classification accuracy to supervised learning methods while reducing the necessity of human-based labeling. The lack of a detailed human review of training data increases the potential for attacks on the training data. Data poisoning attacks adversarially manipulate a small number of training samples in order to shape the performance of the trained network at inference time. Backdoor attacks, one type of data poisoning attack, introduce a backdoor (or an alternative classification pathway) into a trained model that can cause sample misclassification through the introduction of a trigger (a visual feature that is added to a poisoned sample) (Gu et al., 2017) . We focus our analysis on backdoor attacks which poison the unlabeled data in semi-supervised learning. In this setting, backdoors must be introduced in the absence of training labels associated with the poisoned images. Recent semi-supervised learning methods achieve high accuracy with very few labeled samples (Xie et al., 2020; Berthelot et al., 2020; Sohn et al., 2020) using the strategies of pseudolabeling and consistency regularization which introduce new considerations when assessing the risk posed by backdoor attacks. Pseudolabeling assigns hard labels to unlabeled samples based on model predictions (Lee et al., 2013) and is responsible for estimating the training labels of unlabeled poisoned samples. Consistency regularization encourages augmented versions of the same sample to have the same network output (Sajjadi et al., 2016) and requires attacks to be robust to significant augmentations. In this paper we analyze the impact of backdoor data poisoning attacks on semi-supervised learning methods by first reframing the attacks in a setting where pseudolabels are used in lieu of training labels and then highlighting a vulnerability of these methods to attacks which influence expected pseudolabel outputs. We identify characteristics of successful attacks, evaluate how those characteristics can be used to more precisely target semi-supervised learning, and use our insights to suggest new defense strategies. We make the following contributions: • We show simple, black-box backdoor attacks using adversarially perturbed samples are highly effective against semi-supervised learning methods, emphasizing the sensitivity of attack performance to the pseudolabel distribution of poisoned samples. • We analyze unique dynamics of data poisoning during semi-supervised training and identify characteristics of attacks that are important for attack success. • We introduce a generalized attack framework targeting semi-supervised learning.

2. BACKGROUND

2.1 DATA POISONING We focus on integrity attacks in data poisoning which maintain high classification accuracy while encouraging targeted misclassification. Instance-targeted attacks and backdoor attacks are two types of integrity attacks. Instance-targeted attacks aim to cause a misclassification of a specific example at test time (Shafahi et al., 2018; Zhu et al., 2019; Geiping et al., 2020; Huang et al., 2020; Aghakhani et al., 2021) . While an interesting and fruitful area of research, we do not consider instance-targeted attacks in this paper and instead focus on backdoor attacks. Traditional backdoor attacks introduce triggers into poisoned images during training and adapt the images and/or the training labels to encourage the network to ignore the image content of poisoned images and only focus on the trigger (Gu et al., 2017; Turner et al., 2018; Saha et al., 2020; Zhao et al., 2020) . They associate the trigger with a specific target label y t . There are two types of backdoor data poisoning attacks against supervised learning which use different strategies to encourage the creation of a backdoor: dirty label attacks which change the training labels from the ground truth label (Gu et al., 2017) and clean label attacks which maintain the ground truth training label while perturbing the training sample in order to increase the difficulty of sample classification using only image-based features (Turner et al., 2019; Saha et al., 2020; Zhao et al., 2020) . In both of these attacks, the labels are used to firmly fix the desired network output even as the images appear confusing due to perturbations or having a different ground truth class. Greater confusion encourages the network to rely on the triggers, a constant feature in the poisoned samples.

2.2. SEMI-SUPERVISED LEARNING

The goal of semi-supervised learning is to utilize unlabeled data to achieve high accuracy models with few labeled samples. This has been a rich research area with a variety of proposed techniques (Van Engelen & Hoos, 2020; Yang et al., 2021) . We focus on a subset of recent semisupervised learning techniques that have significantly improved classification performance (Xie et al., 2020; Berthelot et al., 2020; Sohn et al., 2020) . These techniques make use of two popular strategies: consistency regularization and pseudolabeling. Consistency regularization is motivated by the manifold assumption that transformed versions of inputs should not change their class identity. In practice, techniques that employ consistency regularization encourage similar network outputs for augmented inputs (Sajjadi et al., 2016; Miyato et al., 2018; Xie et al., 2020) and often use strong augmentations that significantly change the appearance of inputs. Pseudolabeling uses model predictions to estimate training labels for unlabeled samples (Lee et al., 2013) .

2.3. DATA POISONING IN SEMI-SUPERVISED LEARNING

While the focus of data poisoning work to date has been on supervised learning, there is recent work focused on the impact of data poisoning attacks on semi-supervised learning. Poisoning attacks on labeled samples have been developed which target graph-based semi-supervised learning methods by focusing on poisoning labeled samples that have the greatest influence on the inferred labels of unlabeled samples (Liu et al., 2019a; Franci et al., 2022) . Carlini (2021) introduced a poisoning attack on the unlabeled samples which exploits the pseudolabeling mechanism. This is an instancetargeted attack which aims to propagate the target label from confident target class samples to the target samples (from a non-target class) using interpolated samples between them. Feng et al. (2022) poisons unlabeled samples using a network that transform samples so they appear to the user's network like the target class. Unlike the the traditional goal of backdoor attacks of introducing a backdoor associated with static triggers, they aim to adapt the decision boundary to be susceptible to future transformed samples. Yan et al. (2021) investigate perturbation-based attacks on unlabeled samples in semi-supervised learning similar to us, but find a simple perturbation-based attack has low attack success. Rather they

