MODEL-TARGETED POISONING ATTACKS WITH PROVABLE CONVERGENCE

Abstract

In a poisoning attack, an adversary with control over a small fraction of the training data attempts to select that data in a way that induces a model that misbehaves in a particular way desired by the adversary, such as misclassifying certain inputs. We propose an efficient poisoning attack that can target a desired model based on online convex optimization. Unlike previous model-targeted poisoning attacks, our attack comes with provable convergence to any achievable target classifier. The distance from the induced classifier to the target classifier is inversely proportional to the square root of the number of poisoning points. We also provide a lower bound on the minimum number of poisoning points needed to achieve a given target classifier. Our attack is the first model-targeted poisoning attack that provides provable convergence, and in our experiments it either exceeds or matches the best state-of-the-art attacks in terms of attack success rate and distance to the target model. In addition, as an online attack our attack can incrementally determine nearly optimal poisoning points.

1. INTRODUCTION

State-of-the-art machine learning models require a large amount of labeled training data, which often depends on collecting data and labels from untrusted sources. A typical application is email spam filtering, where a spam detector filters out spam messages based on features (e.g., presence of certain words) and periodically updates the model based on newly received emails labeled by users. In such a setting, spammers can generate "non-spam" messages by injecting non-related words or benign words, and when models are trained on these "non-spam" messages, the filtering accuracy will drop significantly (Lowd & Meek, 2005) . Such attacks are known as poisoning attacks, and a training process that collects labels or data from untrusted sources is potentially vulnerable to them. Poisoning attacks can be categorized into objective-driven attacks and model-targeted attacks depending on whether a target model is considered in the attack process. Objective-driven attacks have a specific attacker objective and aim to achieve the attack objective by generating the poisoning points; model-targeted attacks have a specific target classifier in mind and aim to induce that target classifier by generating a minimal number of poisoning points. Objective-driven attacks are most commonly studied in the existing literature. The attacker objective is typically one of two extremes: indiscriminate attacks, where the adversary's goal is simply to decrease the overall accuracy of the model (Biggio et al., 2012; Xiao et al., 2012; Mei & Zhu, 2015b; Steinhardt et al., 2017; Koh et al., 2018) ; and instance-targeted attacks, where the goal is to produce a classifier that misclassifies a particular known input (Shafahi et al., 2018; Zhu et al., 2019; Koh & Liang, 2017) . Recently, Jagielski et al. (2019) introduced a more realistic attacker objective known as a subpopulation attack, where the goal is to increase the error rate or obtain a particular output for a defined subpopulation of the data distribution. Attacker objectives for realistic attacks are diverse and designing a unified and effective attack strategy for different attacker objectives is hard. Gradient-based local optimization is most commonly used to construct poisoning points for a particular attacker objective (Biggio et al., 2012; Xiao et al., 2012; Mei & Zhu, 2015b; Koh & Liang, 2017; Shafahi et al., 2018; Zhu et al., 2019) . Although these attacks can be modified to fit other attacker objectives, since they are based on local optimization techniques they can easily get stuck into bad local optima and fail to find effective sets of poisoning points (Steinhardt et al., 2017; Koh et al., 2018) . To circumvent the issue of local optima, Steinhardt et al. ( 2017) formulate an indiscriminate attack as a min-max optimization problem and solve it efficiently using online convex optimization techniques. However, the strong min-max attack only applies to the indiscriminate setting. In contrast, model-targeted attacks incorporate the attacker objective into a target model and hence, the target model can reflect any attacker objective. Thus, the same model-targeted attack methods can be directly applied to a range of indiscriminate and subpopulation attacks just by finding a suitable target model. Mei & Zhu (2015b) first introduced a target model into a poisoning attack, but their attack is still based on gradient-based local optimization techniques and suffers from bad local optima (Steinhardt et al., 2017; Koh et al., 2018) . Koh et al. (2018) proposed the KKT attack, which converts the complicated bi-level optimization into a simple convex optimization problem utilizing the KKT condition, avoiding the local optima issues. However, their attack only works for margin based losses and does not provide any guarantee on the number of poisoning points required to converge to the target classifier. In this work, we focus on model-targeted attacks and aim to understand the feasibility of a poisoning adversary to induce any target model. In particular, we find both theoretical and empirical bounds on the sufficient (and necessary) number of poisoning points to get close to a specific target classier.foot_0 Contributions. Our main contributions involve developing a principled and general model-targeted poisoning attack strategy, along with a proof that the model it induces converges to the target model. Our poisoning method takes as input a target model, and produces a set of poisoning points. We prove that the model induced by training on the original training data with these points added, converges to the target classifier as the number of poison points increases, given that the loss function is convex and proper regularization is adopted in training (Theorem 4.1). Previous model-targeted attacks lack of such convergence guarantees. We then prove a lower bound on the minimum number of poisoning points needed to reach the target model (Theorem 4.2), given that the loss function for empirical risk minimization is convex. Such a lower bound can be used to estimate the optimality of model-targeted poisoning attacks and also indicate the intrinsic hardness of attacking different targets. Our attack is also efficient in incremental poisoning scenario as it works in an online fashion and can incrementally find poisoning points that are nearly optimal. Previous model-targeted attacks work with fixed number of poisoning points and need to know the poisoning budget in advance. We run experiments to compare our attack to the state-of-the-art model-targeted attack (Koh et al., 2018) . We first evaluate the convergence our attack to the target model and find that, under same number of poisoning points, classifiers induced by our attack are closer to the target models than the best known attack, for all the target classifiers we tried. Then, we evaluate the success rate of our attack, and find that it has superior performance than the state-of-the-art in the more realistic subpopulation attack scenario, and comparable performance in the conventional indiscriminate attack scenario (Section 5).

2. PROBLEM SETUP

The poisoning attack proposed in this paper applies to multi-class prediction tasks or regression problems (by treating the response variable as an additional data feature), but for simplicity of presentation we consider a binary prediction task, h : X → Y, where X ⊆ R d and Y = {+1, -1}. The prediction model h is characterized by parameters θ ∈ Θ ⊆ R d . We define the non-negative convex loss on an individual point, (x, y), as l(θ; x, y) (e.g., hinge loss for SVM model). We also define the empirical loss over a set of points A as L(θ; A) = (x,y)∈A l(θ; x, y). We adopt the game-theoretic formalization of the poisoning attack process from Steinhardt et al. 



Similar to previous works, in this paper, we focus on designing a model-targeted attack that works for any achievable target model and leave the exploration of finding better target classifiers as the future work.



(2017) to describe our model-targeted attack scenario: 1. N data points are drawn uniformly at random from the true data distribution over X × Y and form the clean training set, D c . 2. The adversary, with knowledge of D c , the model training process and the model space Θ, generates a target classifier θ p ∈ Θ that satisfies the attack goal. 3. The adversary produces a set of poisoning points, D p , with the knowledge of D c , model training process, Θ and θ p . 4. Model builder trains the model on D c ∪ D p and produces a classifier, θ atk .

