MODEL-TARGETED POISONING ATTACKS WITH PROVABLE CONVERGENCE

Abstract

In a poisoning attack, an adversary with control over a small fraction of the training data attempts to select that data in a way that induces a model that misbehaves in a particular way desired by the adversary, such as misclassifying certain inputs. We propose an efficient poisoning attack that can target a desired model based on online convex optimization. Unlike previous model-targeted poisoning attacks, our attack comes with provable convergence to any achievable target classifier. The distance from the induced classifier to the target classifier is inversely proportional to the square root of the number of poisoning points. We also provide a lower bound on the minimum number of poisoning points needed to achieve a given target classifier. Our attack is the first model-targeted poisoning attack that provides provable convergence, and in our experiments it either exceeds or matches the best state-of-the-art attacks in terms of attack success rate and distance to the target model. In addition, as an online attack our attack can incrementally determine nearly optimal poisoning points.

1. INTRODUCTION

State-of-the-art machine learning models require a large amount of labeled training data, which often depends on collecting data and labels from untrusted sources. A typical application is email spam filtering, where a spam detector filters out spam messages based on features (e.g., presence of certain words) and periodically updates the model based on newly received emails labeled by users. In such a setting, spammers can generate "non-spam" messages by injecting non-related words or benign words, and when models are trained on these "non-spam" messages, the filtering accuracy will drop significantly (Lowd & Meek, 2005) . Such attacks are known as poisoning attacks, and a training process that collects labels or data from untrusted sources is potentially vulnerable to them. Poisoning attacks can be categorized into objective-driven attacks and model-targeted attacks depending on whether a target model is considered in the attack process. Objective-driven attacks have a specific attacker objective and aim to achieve the attack objective by generating the poisoning points; model-targeted attacks have a specific target classifier in mind and aim to induce that target classifier by generating a minimal number of poisoning points. Objective-driven attacks are most commonly studied in the existing literature. The attacker objective is typically one of two extremes: indiscriminate attacks, where the adversary's goal is simply to decrease the overall accuracy of the model (Biggio et al., 2012; Xiao et al., 2012; Mei & Zhu, 2015b; Steinhardt et al., 2017; Koh et al., 2018) ; and instance-targeted attacks, where the goal is to produce a classifier that misclassifies a particular known input (Shafahi et al., 2018; Zhu et al., 2019; Koh & Liang, 2017) . Recently, Jagielski et al. ( 2019) introduced a more realistic attacker objective known as a subpopulation attack, where the goal is to increase the error rate or obtain a particular output for a defined subpopulation of the data distribution. Attacker objectives for realistic attacks are diverse and designing a unified and effective attack strategy for different attacker objectives is hard. Gradient-based local optimization is most commonly used to construct poisoning points for a particular attacker objective (Biggio et al., 2012; Xiao et al., 2012; Mei & Zhu, 2015b; Koh & Liang, 2017; Shafahi et al., 2018; Zhu et al., 2019) . Although these attacks can be modified to fit other attacker objectives, since they are based on local optimization techniques they can easily get stuck into bad local optima and fail to find effective sets of poisoning points (Steinhardt et al., 2017; Koh et al., 2018) . To circumvent the issue of local optima, Steinhardt et al. (2017) formulate an indiscriminate attack as a min-max optimization problem and

