CERTIFIED ROBUSTNESS OF NEAREST NEIGHBORS AGAINST DATA POISONING ATTACKS

Abstract

Data poisoning attacks aim to corrupt a machine learning model via modifying, adding, and/or removing some carefully selected training examples, such that the corrupted model predicts any or attacker-chosen incorrect labels for testing examples. The key idea of state-of-the-art certified defenses against data poisoning attacks is to create a majority vote mechanism to predict the label of a testing example. Moreover, each voter is a base classifier trained on a subset of the training dataset. Nearest neighbor algorithms such as k nearest neighbors (kNN) and radius nearest neighbors (rNN) have intrinsic majority vote mechanisms. In this work, we show that the intrinsic majority vote mechanisms in kNN and rNN already provide certified robustness guarantees against general data poisoning attacks. Moreover, our empirical evaluation results on MNIST and CIFAR10 show that the intrinsic certified robustness guarantees of kNN and rNN outperform those provided by state-of-the-art certified defenses.

1. INTRODUCTION

Data poisoning attacks (Barreno et al., 2006; Nelson et al., 2008; Biggio et al., 2012; 2013a; Xiao et al., 2015b; Steinhardt et al., 2017; Shafahi et al., 2018) aim to corrupt the training phase of a machine learning system via carefully poisoning its training dataset including modifying, adding, and/or removing some training examples. The corrupted model predicts incorrect labels for testing examples. Data poisoning attacks pose severe security concerns to machine learning in critical application domains such as autonomous driving (Gu et al., 2017 ), cybersecurity (Rubinstein et al., 2009; Suciu et al., 2018; Chen et al., 2017) , and healthcare analytics (Mozaffari-Kermani et al., 2014) . Unlike adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2014; Carlini & Wagner, 2017) , which add perturbation to each testing example to induce misclassification, data poisoning attacks corrupt the model such that it misclassifies many clean testing examples. Multiple certifiably robust learning algorithms (Ma et al., 2019; Rosenfeld et al., 2020; Levine & Feizi, 2020; Jia et al., 2020) against data poisoning attacks were recently developed. A learning algorithm is certifiably robust against data poisoning attacks if it can learn a classifier on a training dataset that achieves a certified accuracy on a testing dataset when the number of poisoned training examples is no more than a threshold (called poisoning size). The certified accuracy of a learning algorithm is a lower bound of the accuracy of its learnt classifier no matter how an attacker poisons the training examples with the given poisoning size. The key idea of state-of-the-art certifiably robust learning algorithms (Levine & Feizi, 2020; Jia et al., 2020) is to create a majority vote mechanism to predict the label of a testing example. In particular, each voter votes a label for a testing example and the final predicted label is the majority vote among multiple voters. For instance, Deep Partition Aggregation (DPA) (Levine & Feizi, 2020) divides the training dataset into disjoint partitions and learns a base classifier (i.e., a voter) on each partition. Bagging (Jia et al., 2020) also learns multiple base classifiers (i.e., voters), but each of them is learnt on a random subsample of the training dataset. We denote by a and b the labels with the largest and second largest number of votes, respectively. Moreover, s a and s b respectively are the number of votes for labels a and b when there are no corrupted voters. The corrupted voters change their votes from a to b in the worst-case scenario. Therefore, the majority vote result (i.e., the predicted label for a testing example) remains to be a when the number of corrupted voters is no larger than sa-s b 2 -1. In other words, the number of corrupted voters that a majority vote mechanism can tolerate depends on the gap s a -s b between the largest and the second largest number of votes.

