RETHINKING BACKDOOR DATA POISONING ATTACKS IN THE CONTEXT OF SEMI-SUPERVISED LEARNING Anonymous authors Paper under double-blind review

Abstract

Semi-supervised learning methods can train high-accuracy machine learning models with a fraction of the labeled training samples required for traditional supervised learning. Such methods do not typically involve close review of the unlabeled training samples, making them tempting targets for data poisoning attacks. In this paper we investigate the vulnerabilities of semi-supervised learning methods to backdoor data poisoning attacks on the unlabeled samples. We show that a simple poisoning attack that influences the distribution of the poisoned samples' predicted labels is highly effective -achieving an average attack success rate of 93.6%. We introduce a generalized attack framework targeting semi-supervised learning methods to better understand and exploit their limitations and to motivate future defense strategies.

1. INTRODUCTION

Machine learning models have achieved high classification accuracy through the use of large, labeled datasets. However, the creation of diverse datasets with supervised labels is time-consuming and costly. In recent years, semi-supervised learning methods have been introduced which train models using a small set of labeled data and a large set of unlabeled data. These models achieve comparable classification accuracy to supervised learning methods while reducing the necessity of human-based labeling. The lack of a detailed human review of training data increases the potential for attacks on the training data. Data poisoning attacks adversarially manipulate a small number of training samples in order to shape the performance of the trained network at inference time. Backdoor attacks, one type of data poisoning attack, introduce a backdoor (or an alternative classification pathway) into a trained model that can cause sample misclassification through the introduction of a trigger (a visual feature that is added to a poisoned sample) (Gu et al., 2017) . We focus our analysis on backdoor attacks which poison the unlabeled data in semi-supervised learning. In this setting, backdoors must be introduced in the absence of training labels associated with the poisoned images. Recent semi-supervised learning methods achieve high accuracy with very few labeled samples (Xie et al., 2020; Berthelot et al., 2020; Sohn et al., 2020) using the strategies of pseudolabeling and consistency regularization which introduce new considerations when assessing the risk posed by backdoor attacks. Pseudolabeling assigns hard labels to unlabeled samples based on model predictions (Lee et al., 2013) and is responsible for estimating the training labels of unlabeled poisoned samples. Consistency regularization encourages augmented versions of the same sample to have the same network output (Sajjadi et al., 2016) and requires attacks to be robust to significant augmentations. In this paper we analyze the impact of backdoor data poisoning attacks on semi-supervised learning methods by first reframing the attacks in a setting where pseudolabels are used in lieu of training labels and then highlighting a vulnerability of these methods to attacks which influence expected pseudolabel outputs. We identify characteristics of successful attacks, evaluate how those characteristics can be used to more precisely target semi-supervised learning, and use our insights to suggest new defense strategies. We make the following contributions: • We show simple, black-box backdoor attacks using adversarially perturbed samples are highly effective against semi-supervised learning methods, emphasizing the sensitivity of attack performance to the pseudolabel distribution of poisoned samples.

