MODEL-TARGETED POISONING ATTACKS WITH PROVABLE CONVERGENCE

Abstract

In a poisoning attack, an adversary with control over a small fraction of the training data attempts to select that data in a way that induces a model that misbehaves in a particular way desired by the adversary, such as misclassifying certain inputs. We propose an efficient poisoning attack that can target a desired model based on online convex optimization. Unlike previous model-targeted poisoning attacks, our attack comes with provable convergence to any achievable target classifier. The distance from the induced classifier to the target classifier is inversely proportional to the square root of the number of poisoning points. We also provide a lower bound on the minimum number of poisoning points needed to achieve a given target classifier. Our attack is the first model-targeted poisoning attack that provides provable convergence, and in our experiments it either exceeds or matches the best state-of-the-art attacks in terms of attack success rate and distance to the target model. In addition, as an online attack our attack can incrementally determine nearly optimal poisoning points.

1. INTRODUCTION

State-of-the-art machine learning models require a large amount of labeled training data, which often depends on collecting data and labels from untrusted sources. A typical application is email spam filtering, where a spam detector filters out spam messages based on features (e.g., presence of certain words) and periodically updates the model based on newly received emails labeled by users. In such a setting, spammers can generate "non-spam" messages by injecting non-related words or benign words, and when models are trained on these "non-spam" messages, the filtering accuracy will drop significantly (Lowd & Meek, 2005) . Such attacks are known as poisoning attacks, and a training process that collects labels or data from untrusted sources is potentially vulnerable to them. Poisoning attacks can be categorized into objective-driven attacks and model-targeted attacks depending on whether a target model is considered in the attack process. Objective-driven attacks have a specific attacker objective and aim to achieve the attack objective by generating the poisoning points; model-targeted attacks have a specific target classifier in mind and aim to induce that target classifier by generating a minimal number of poisoning points. Objective-driven attacks are most commonly studied in the existing literature. The attacker objective is typically one of two extremes: indiscriminate attacks, where the adversary's goal is simply to decrease the overall accuracy of the model (Biggio et al., 2012; Xiao et al., 2012; Mei & Zhu, 2015b; Steinhardt et al., 2017; Koh et al., 2018) ; and instance-targeted attacks, where the goal is to produce a classifier that misclassifies a particular known input (Shafahi et al., 2018; Zhu et al., 2019; Koh & Liang, 2017) . Recently, Jagielski et al. (2019) introduced a more realistic attacker objective known as a subpopulation attack, where the goal is to increase the error rate or obtain a particular output for a defined subpopulation of the data distribution. Attacker objectives for realistic attacks are diverse and designing a unified and effective attack strategy for different attacker objectives is hard. Gradient-based local optimization is most commonly used to construct poisoning points for a particular attacker objective (Biggio et al., 2012; Xiao et al., 2012; Mei & Zhu, 2015b; Koh & Liang, 2017; Shafahi et al., 2018; Zhu et al., 2019) . Although these attacks can be modified to fit other attacker objectives, since they are based on local optimization techniques they can easily get stuck into bad local optima and fail to find effective sets of poisoning points (Steinhardt et al., 2017; Koh et al., 2018) . To circumvent the issue of local optima, Steinhardt et al. (2017) formulate an indiscriminate attack as a min-max optimization problem and solve it efficiently using online convex optimization techniques. However, the strong min-max attack only applies to the indiscriminate setting. In contrast, model-targeted attacks incorporate the attacker objective into a target model and hence, the target model can reflect any attacker objective. Thus, the same model-targeted attack methods can be directly applied to a range of indiscriminate and subpopulation attacks just by finding a suitable target model. Mei & Zhu (2015b) first introduced a target model into a poisoning attack, but their attack is still based on gradient-based local optimization techniques and suffers from bad local optima (Steinhardt et al., 2017; Koh et al., 2018) . Koh et al. (2018) proposed the KKT attack, which converts the complicated bi-level optimization into a simple convex optimization problem utilizing the KKT condition, avoiding the local optima issues. However, their attack only works for margin based losses and does not provide any guarantee on the number of poisoning points required to converge to the target classifier. In this work, we focus on model-targeted attacks and aim to understand the feasibility of a poisoning adversary to induce any target model. In particular, we find both theoretical and empirical bounds on the sufficient (and necessary) number of poisoning points to get close to a specific target classier. 1Contributions. Our main contributions involve developing a principled and general model-targeted poisoning attack strategy, along with a proof that the model it induces converges to the target model. Our poisoning method takes as input a target model, and produces a set of poisoning points. We prove that the model induced by training on the original training data with these points added, converges to the target classifier as the number of poison points increases, given that the loss function is convex and proper regularization is adopted in training (Theorem 4.1). Previous model-targeted attacks lack of such convergence guarantees. We then prove a lower bound on the minimum number of poisoning points needed to reach the target model (Theorem 4.2), given that the loss function for empirical risk minimization is convex. Such a lower bound can be used to estimate the optimality of model-targeted poisoning attacks and also indicate the intrinsic hardness of attacking different targets. Our attack is also efficient in incremental poisoning scenario as it works in an online fashion and can incrementally find poisoning points that are nearly optimal. Previous model-targeted attacks work with fixed number of poisoning points and need to know the poisoning budget in advance. We run experiments to compare our attack to the state-of-the-art model-targeted attack (Koh et al., 2018) . We first evaluate the convergence our attack to the target model and find that, under same number of poisoning points, classifiers induced by our attack are closer to the target models than the best known attack, for all the target classifiers we tried. Then, we evaluate the success rate of our attack, and find that it has superior performance than the state-of-the-art in the more realistic subpopulation attack scenario, and comparable performance in the conventional indiscriminate attack scenario (Section 5).

2. PROBLEM SETUP

The poisoning attack proposed in this paper applies to multi-class prediction tasks or regression problems (by treating the response variable as an additional data feature), but for simplicity of presentation we consider a binary prediction task, h : X → Y, where X ⊆ R d and Y = {+1, -1}. The prediction model h is characterized by parameters θ ∈ Θ ⊆ R d . We define the non-negative convex loss on an individual point, (x, y), as l(θ; x, y) (e.g., hinge loss for SVM model). We also define the empirical loss over a set of points A as L(θ; A) = (x,y)∈A l(θ; x, y). We adopt the game-theoretic formalization of the poisoning attack process from Steinhardt et al. (2017) to describe our model-targeted attack scenario: The adversary's goal is that the induced classifier, θ atk , is close to the desired target classifier, θ p (Section 4.2 discusses how this distance is measured). Step 2 corresponds to the target classifier generation process. Our attack works for any target classifier, and in the paper we do not focus on the question of how to find the best target classifier to achieve a particular adversarial goal but simply adopt the heuristic target classifier generation process from Koh et al. (2018) . Step 3 corresponds to our model-targeted poisoning attack, and is also the main contribution of the paper. We assume the model builder trains a model through empirical risk minimization (ERM) and the training process details are known to the attacker: θ c = arg min θ∈Θ 1 |D c | L(θ; D c ) + C R • R(θ) where R(θ) is the regularization function (e.g., 1 2 θ 2 2 for SVM model). Threat Model. We assume an adversary with full knowledge of training data, model space and training process. Although this may be unrealistic for many scenarios, this setting allows us to focus on a particular aspect of poisoning attacks, and is the setting used in many prior works (Biggio et al., 2011; Mei & Zhu, 2015b; Steinhardt et al., 2017; Koh et al., 2018; Shafahi et al., 2018) . We assume an addition-only attack where the attacker only adds poisoning points into the clean training set. A stronger attacker may be able to modify or remove existing points, but this typically requires administrative access to the system. The added points are unconstrained, other than being value elements of the input space. They can have arbitrary features and labels, which enables us to perform the worst case analysis on the robustness of models against addition-only poisoning attacks. Although some previous works also allow arbitrary selection of the poisoning points (Biggio et al., 2011; Mei & Zhu, 2015b; Steinhardt et al., 2017; Koh et al., 2018) , others put different restrictions on the poisoning appoints. A clean-label attack assumes adversaries can only perturb the features of the data, but the label is given by an oracle labeler (Koh & Liang, 2017; Shafahi et al., 2018; Zhu et al., 2019; Huang et al., 2020) . In label-flipping attacks, adversaries are only allowed to change the labels (Biggio et al., 2011; Xiao et al., 2012; 2015; Jagielski et al., 2019) . These restricted attacks are weaker than the poisoning attacks without restrictions (Koh et al., 2018; Hong et al., 2020) .

3. RELATED WORK

The most commonly used poisoning strategy is gradient-based attack. Gradient-based attacks iteratively modify a candidate poisoning point (x, ŷ) in the set D p based on the test loss defined on x (keeping ŷ fixed). This kind of attack was first studied on SVM models (Biggio et al., 2012) , and later extended to linear and logistic regression (Mei & Zhu, 2015b) , and recently to larger neural network models (Koh & Liang, 2017; Yang et al., 2017; Muñoz-González et al., 2017; Shafahi et al., 2018; Zhu et al., 2019; Huang et al., 2020) . Jagielski et al. (2018) also studied gradient attacks and principled defenses on linear regression tasks. Their work studies linear regression while in this paper, we mainly focus on binary classification, although our attack can also be extended to regression tasks. More importantly, our attack aims to induce a target model by generating poisoning points while Jagielski et al. (2018) 's attack tries to increase the Mean Squared Error of the linear regression task with a fixed poisoning budget. In addition to classification and regression tasks, gradient-based poisoning attacks are also applied to topic modeling (Mei & Zhu, 2015a) , collaborative filtering (Li et al., 2016) and algorithmic fairness (Solans et al., 2020) . Besides the gradient-based attacks, researchers also utilize generative adversarial networks to craft poisoning points efficiently for larger neural networks, however, the effectiveness of the attack is limited (Yang et al., 2017; Muñoz-González et al., 2019) . The strongest attacks so far are the KKT attack (Koh et al., 2018) and the min-max attack (Steinhardt et al., 2017; Koh et al., 2018) . However, the KKT attack cannot scale well for multi-class classification and is limited to margin based losses (Koh et al., 2018) . The min-max attack only works for indiscriminate attack setting, but additionally provides a certificate on worst case test loss for a fixed number of poisoning points. We are also inspired by Steinhardt et al. (2017) to adopt online convex optimization to instantiate our model-targeted attack, but now deals more general attack scenario. We also distinguish ourselves from the poisoning attack against online learning (Wang & Chaudhuri, 2018) . The attack against online learning considers a setting where training data arrives in a streaming manner while we consider the offline setting with training data being fixed. Another line of work studies "targeted" poisoning attacks where an adversary guarantees to increase the probability of an arbitrary "bad" property (Mahloujifar et al., 2019a; 2017; 2019b) , as long as that property has some non-negligible chance of naturally happening. These attacks cannot be applied in the model-targeted setting as the probability of naturally producing a specific target model is often negligible. Related to our Theorem 4.2, Ma et al. (2019) also derived a lower bound on number of poisoning points (to induce a target model), but their lower bound only applies when differential privacy is deployed during the model training process (and hence hurts model utility), which is different from our problem setting.

4. POISONING ATTACK WITH A TARGET MODEL

Our new poisoning attack determines a target model and selects poisoning points to achieve that target model. The target model generation is not our focus and we adopt the heuristic approach proposed by Koh et al. (2018) . For the new poisoning attack, first, we show the algorithm that generates the poisoning points in Section 4.1 and then prove that the generated poisoning points, once added to the clean data, can produce a classifier that asymptotically converges to the target classifier in Section 4.2.

4.1. MODEL-TARGETED POISONING WITH ONLINE LEARNING

The main idea of our model-targeted poisoning attack, as outlined in Algorithm 1, is to sequentially add a point into the training set that have maximum loss difference between the intermediate model obtained so far and the target model, and by training models on the updated training set, we actually minimize the gap in the loss of the intermediate classifier and the target classifier. Repeating the process then eventually generates classifiers that have similar loss distribution as the target classifier. We show in Section 4.2 why similar loss distribution implies convergence. The adversary then searches for the point that maximizes the loss difference between θ t and θ p (Line 4). After the point of maximum loss difference is found, it is added to the poisoning set D p (Line 5). The whole process repeats until the stop condition is satisfied in Line 2. The stop condition is flexible and it can take various forms: 1) adversary has a budget T on number of poisoning points, and the algorithm halts when the algorithm runs for T iterations; 2) the intermediate classifier θ t is closer to the target classifier (than a preset threshold ) in terms of the maximum loss difference, and more details regarding this distance metric will be introduced in Section 4.2; 3) adversary has some requirement on the accuracy and the algorithm terminates when θ t satisfies the accuracy requirement. Since we focus on producing a classifier close to the target model, we adopt the second stop criterion that measures the distance with respect to the maximum loss difference, and report results based on this criterion in Section 5. A nice property of Algorithm 1 is that the classifier θ atk trained on D c ∪D p is close to the target model θ p and asymptotically converges to θ p . Details of the convergence will be shown in the next section. The algorithm may appear to be slow, particularly for larger models due to requirement of repeatedly training a model in line 3. However, this is not an issue. First, as will be shown in next section, the algorithm is an online optimization process and line 3 corresponds to solving the online optimization problem exactly. However, people often use the very efficient online gradient descent method to approximately solve the problem and its asymptotic performance is the same (Shalev-Shwartz, 2012) . Second, if we solve the optimization problem exactly, we can add multiple copies of (x * , y * ) into D p each time. This reduces the overall iteration number, and hence reduces the number of times retraining models. The proof of convergence will be similar. For simplicity in interpreting the results, we do not use this in our experiments and add only one copy of (x * , y * ) each iteration. However, we also tested the performance by adding two copies of (x * , y * ) and find that the attack results are nearly the same while the efficiency is improved significantly. For example, for the experiments we tried on MNIST 1-7 dataset, by adding 2 copies of points, with same number of poisoning points, the attack success rate decreases at most by 0.7% while the execution time is reduced approximately by half.

4.2. CONVERGENCE OF OUR POISONING ATTACK

Before proving the convergence of Algorithm 1, we need to measure the distance of the model θ atk trained on D c ∪ D p to the target model θ p . First, we define a general closeness measure based on their prediction performance which we will use to state our convergence theorem: Definition 1 (Loss-based distance and -close). For two models θ 1 and θ 2 , a space X × Y and a loss function l(θ; x, y), we define loss-based distance D l,X ,Y : Θ × Θ → R as D l,X ,Y (θ 1 , θ 2 ) = max (x,y)∈X ×Y l(θ 1 ; x, y) -l(θ 2 ; x, y), and we say model θ 1 is -close to model θ 2 when the loss-based distance from θ 1 to θ 2 is upper bounded by . Why is loss-based distance a meaningful notion of closeness? We argue that this notion captures the "behavorial" distance between two models. Namely, if θ 1 is -close (as measured by loss-based distance) to θ 2 and vice versa, then θ 1 and θ 2 would have almost equal loss on all the points, meaning that they have almost the same behavior across all the space. Note that our general definition of lossbased distance does not have the symmetry property of metrics and hence is not a metric. However, it has some other properties of metrics in the space of attainable models. For example, if some model θ is attainable using ERM, no model could have negative distance to it. To further show the value of this distance notion, in Appendix B we demonstrate an O( ) upper bound on the 1 -norm of difference between two models that are -close with respect to loss-based distance for the special case of Hinge loss. For Hinge loss, it also satisfies the bi-directional closeness, that is if θ 1 is -close to θ 2 , then θ 2 is O( )-close to θ (details can be found in Corollary B.2.1), and the proof details can be found in Appendix B. In the rest of the paper, we will use terms -close or -closeness to denote that a model is away from another model based on the loss-based distance. Our convergence theorem uses the loss-based distance to establish that the attack of Algorithm 1 converges to the target classifier: Theorem 4.1. After at most T steps, Algorithm 1 will produce the poisoning set D p and the classifier trained on D c ∪ D p is -close to θ p , with respect to loss-based distance, D l,X ,Y , for = α(T ) + L(θ p ; D c ) -L(θ c ; D c ) T • γ where, γ is a constant for a given θ p and classification task, and α(T ) is the regret of the online algorithm when the loss function used for training is convex. Remark 1. Online learning algorithms with sublinear regret bound can be applied to show the convergence. Here, we adopt results from McMahan (2017) . Specifically, α(T ) is in the order of O(log T )) and we have ≤ O( log T T ) when the loss function is additionally Lipschitz continuous and the regularizer R(θ) is strongly convex, and → 0 when T → +∞. α(T ) is also in the order of O(log T ) when the loss function used for training is strongly convex and the regularizer is convex. Proof idea. The full proof of Theorem 4.1 is in Appendix A. Here, we only summarize the high level proof idea. The key idea is to frame the poisoning problem as an online learning problem. In this formulation, each step of the online learning problem corresponds to the ith poison point (x i , y i ). In particular, the loss function at iteration i of the online learning problem is set to l(•; x i , y i ). Then, we show that by defining the parameters of the online learning problem in a careful way, the output of the follow-the-leader (FTL) algorithm (Shalev-Shwartz, 2012) at iteration i is a model that is identical to training a model on a dataset consisting of the clean points and the first i -1 poisoning points. On the other hand, the way the poisoning points are selected, we can show that at the ith iteration the maximum loss difference between the target model and the best induced model so far would be smaller than the regret of the FTL algorithm divided by the number of poisoning points. The convergence bound of Theorem 4.1 boils down to regret analysis of the algorithm based on the loss function. Since we are assuming the loss function is convex with a strongly convex regularizer (or a strongly convex loss function with a convex regularizer), we can show that the regret is bounded by O(log T ) and hence the loss distance between the induced model and the target model converges to 0. Implications of Theorem 4.1 The theorem says that the loss-based distance of the model trained on D c ∪ D p to the target model correlates to the loss difference between the target model and the clean model θ c (trained on D c ) on D c , and correlates inversely with the number of poisoning points. Therefore, it implies 1) if the target classifier θ p has lower loss on D c , then it is easier to achieve the target model, and 2) with more poisoning points, we get closer to the target classifier and our attack will be more effective. The theorem also justifies the motivation behind the heuristic method in Koh et al. (2018) to select a target classifier with lower loss on clean data. For the indiscriminate attack scenario, we also improve the heuristic approach by adaptively updating the model and producing target classifiers with much lower loss on the clean set. This helps to empirically validate our theorem. Details of the original and improved heuristic approach, and relevant experiments are in Appendix D.1.

4.3. LOWER BOUND ON THE NUMBER OF POISONING POINTS

We first provide the lower bound on number of poisoning points required for producing the target classifier in addition only setting (Theorem 4.2), and then explain how the lower bound estimation can be incorporated into Algorithm 1. The intuition behind the theorem below is, when the number of poisoning points added to the clean training set is smaller than the lower bound, there always exists a classifier θ with lower loss compared to θ p and hence the target classifier cannot be attained. Theorem 4.2 (Lower Bound). Given a target classifier θ p , to reproduce θ p by adding the poisoning set D p into D c , the number of poisoning points |D p | cannot be lower than sup θ z(θ) = L(θ p ; D c ) -L(θ; D c ) + N C R (R(θ p ) -R(θ)) sup x,y l(θ; x, y) -l(θ p ; x, y) + C R (R(θ) -R(θ p )) . Corollary 4.2.1. If we further assume bi-directional closeness in the loss-based distance, we can also derive the lower bound on number of poisoning points needed to induce models that are -close to the target model. More precisely, if θ 1 being -close to θ 2 implies that θ 2 is also k • close to θ 1 , then we have, sup θ z (θ) = L(θ p ; D c ) -L(θ; D c ) -N C R • R * -N k sup x,y l(θ; x, y) -l(θ p ; x, y) + C R • R * + k . where R * is an upper bound on the nonnegative regularizer R(θ). The formula for the lower bound in Theorem 4.2 (and also the lower bound in Corollary 4.2.1) can be easily incorporated into Algorithm 1 to obtain tighter theoretical lower bound. We simply need to check all of the intermediate classifier θ t produced during the attack process and replace θ with θ t , and the lower bound can be computed for the pair of θ t and θ p . Algorithm 1 then additionally returns the lower bound, which is the highest lower bound computed from our poisoning procedure.

5. EXPERIMENTS

We first describe our experimental setup regarding the datasets, models, attacks and target classifiers. Next, we present the experimental results by showing the convergence of Algorithm 1, the comparison of attack success rates to state-of-the-art poisoning attack, and the theoretical lower bound for inducing a given target classifier and its gap to the number of poisoning points used by our attack. We are most interested in subpopulation attacks, since they correspond to the more realistic attacker goal of impacting the classifier outputs for a targeted subpopulation. Therefore, in the main body, we introduce the results of SVM model on the Adult dataset (Dua & Graff, 2017) in the subpopulation poisoning scenario. For completeness, we also evaluate our attack on SVM model on MNIST 1-7 dataset in the indiscriminate poisoning scenario but defer details on those experiments to Appendix D. Our findings for the indiscriminate attacks are that the attack gradually and consistently converges to the target model in terms of the maximum loss difference and the Euclidean distance to the target, with attack success rates that are comparable to the state-of-the-art attack (unlike the subpopulation attacks, where our attack produces superior results). To further verify the universal effectiveness of our attack, we also evaluate our attack on additional dataset (Dogfish) and model (logistic regression), and more details can be found in Appendix F. Dataset, Model and Attacks. For the subpopulation attack experiments, we use the Adult dataset (Dua & Graff, 2017) . This dataset was used for evaluation by the first subpopulation attack paper (Jagielski et al., 2019) . We downsampled the Adult dataset to ensure it is class-balanced and we ended up having 15,682 training and 7,692 test examples. We conduct experiments on linear SVM model and compare our model-targeted poisoning attack in Algorithm 1 to the state-of-the-art KKT attack (Koh et al., 2018) . We do not include the model-targeted attack from Mei & Zhu (2015b) because it underperforms the KKT attack (Koh et al., 2018) . We also do not include objective-driven attacks because our main goal here is to evaluate how well our attack approaches a given target model, across a range of target models. Model-targeted attacks can be compared to objective-driven attacks with regards to a given attacker objective by choosing the target model in a careful way. We show some heuristics of choosing such target models and comparison to some objective-driven attacks in Appendix E. Both our attack and the KKT attack take as input a target classifier and the original training data, and output a set of poisoning points selected with the goal that the induced classifier is as close as possible to the target classifier. We compare the effectiveness of the attacks in selecting poisoning points that converge to a given target classifier by testing the attacks using the same target model. The KKT attack requires a target number of poisoning points as an input while our attack is more flexible and can either take a target number of poisoning points or a threshold for -close distance to the target model. Since we do not know the number of poisoning points needed to reach some attacker goal in advance for the KKT attack, we first run our attack and produce a classifier that satisfies the selected -close distance threshold. The loss function is set as the hinge loss since we target an SVM model in our experiments and we set = 0.01 for all these experiments. Then, we use the size of the poisoning set returned from our attack (denoted by n p ) as the input to the KKT attack for the target number of poisons needed. We also compare the two attacks with varying numbers of poisoning points up to n p . For the KKT attack, its entire optimization process must be rerun whenever the target number of poisoning points changes. Hence, it is infeasible to evaluate the KKT attack on many different poisoning set sizes. In our experiments, we run the KKT attack five poisoning set sizes: 0.2 • n p , 0.4 • n p , 0.6 • n p , 0.8 • n p , and n p . In contrast, we simply run our attack for iterations up to the maximum number of poisoning points, collecting a data point for iteration up to n p . Subpopulations. We identify the subpopulations for the Adult dataset using k-means clustering techniques (ClusterMatch (Jagielski et al., 2019) ) to obtain different clusters (k = 20 in our case). For each cluster, we select instances with label "<=50K" to form the subpopulation (indicating all instances in the subpopulation are in low income group). This way of defining subpopulation is rather arbitrary (in constrast to a more likely attack goal which would select subpopulations based on demographic characteristics), but enables us to simplify the analysis. From the 20 subpopulations obtained, we select three subpopulations with the highest test accuracy on the clean model and they all have 100% test accuracy, indicating all instances in these subpopulations are correctly classified as low income. This enables us to use "attack success rate" and "accuracy" without any ambiguity on the subpopulation-for each of our subpopulations, all instances are originally classified as low income, and the simulated attacker's goal is to have them classified as high income. For each subpopulation, we use the heuristic approach from Koh et al. ( 2018) to generate a target classifier that has 0% accuracy (100% attacker success) on the subpopulation, indicating that all subpopulation instances are now classified as high income. Convergence. Figure 1 shows the convergence of Algorithm 1 using both maximum loss difference and Euclidean distance to the target. The maximum number of poisons (n p ) for the experiments is obtained when the classifier from Algorithm 1 is 0.01-close to the target classifier. Our attack steadily reduces the maximum loss difference and Euclidean distance to the target model, in contrast to the KKT attack which does not seem to converge towards the target model reliably. Concretely, at the maximum number of poisons in Figure 1 , both the maximum loss difference and Euclidean distance of our attack (to the target) is less than 2% of the corresponding distances of the KKT attack. Attack Success. Next, we compare the classifiers induced by the two attacks in terms of the attacker's goal of reducing the test accuracy on the subpopulation. Figure 2 shows the accuracy results for the three subpopulations. For each test, the maximum number of poisoning points is obtained by running our attack with a target of 0.01-closeness (in loss-based distance). For the three subpopulations, at the maximum number of poisons, our attack is much more successful than the KKT attack-the induced classifiers have 0.5% accuracy compared to 15.4% accuracy for KKT on subpopulation 1, 0.0% compared to 6.9% on subpoulation 2, and 0.3% compared to 20.1% on subpoplation 3. Near Optimality of Our Attack. In order to show the optimality of our attack, we calculate a lower bound on the number of poisoning points needed to induce the model that is induced by the poisoning points that are found by our attack. We calculate this lower bound on the number of poisons using Theorem 4.2 (details in Section 4.3). Note that Theorem 4.2 provides a valid lower bound based on any intermediate model. In order to get a lower bound on the number of poisoning points, we only need to use Theorem 4.2 on the encountered intermediate models and report the best one. We do this by running Algorithm 1 using the induced model (and not the previous target model) as the target model, terminating when the induced classifier is 0.01-close to the given target model. We then consider all the intermediate classifiers that the algorithm induced across the iterations. Our calculated lower bound in Table 1 shows that the gap between the lower bound and the number of used poison points is relatively small. This means our attack is nearly optimal in terms of minimizing the number of poisoning points needed to induce the target classifier.

6. CONCLUSION AND DISCUSSION

We propose a general poisoning framework with provable guarantees to reach any achievable target classifier, along with a lower bound on the number of poisoning points needed. Our attack is a generic tool that first captures the goal of adversary as a target model, and then focuses on the power of attacks to induce that model. This separation enables future work to explore the effectiveness of poisoning attacks corresponding to different adversarial goals. Our framework also applies in scenarios where adversaries first remove points and then add new points into the training set. We have not considered defenses in this work, and it is important to study the effectiveness of our attack against data poisoning defenses. Defenses may be designed the limit the search space of the points with maximum loss difference and increasing the number of poisoning points needed. One limitation of our framework is the requirement in the concavity of the difference of loss functions to efficiently search for its maximum value. However, our approach might still be effective in these cases by using local optimization techniques to search for poisoning points with (approximate) maximum loss difference and we have demonstrated this in the case of logistic loss. More formally, if the approximate maximum loss difference l found from local optimization techniques is within a constant factor from the globally optimal value l * (i.e., l ≥ αl * , 0 < α < 1), then we still enjoy similar convergence guarantees. It is important to note that the convergence property of our attack holds (with strongly convex regularizer) for any Lipschitz and convex loss function, and does not require the loss difference to be concave. The theoretical guarantees in the paper do not apply to non-convex models, although it might be possible to empirically apply our attack to these models. Incorporating online learning for non-convex functions might be one possible path to extend our theoretical analysis into non-convex settings.

A PROOFS

In this section, we provide the proofs of the main theorems shown in this paper. For convenience, we restate all the theorems below while also referencing to the main paper. Before proving the main theorem, we introduce two new definitions and several lemmas to assist with the proof. Definition 2 (Attainable models). We say θ is C R -attainable with respect to loss function l and regularization function R if there exists a training set D such that θ = arg min θ∈Θ 1 |D| • L(θ; D) + C R • R(θ) Lemma A.1. Let θ 1 and θ 2 be two C R -attainable parameters for some C R > 0 such that R(θ 1 ) > R(θ 2 ). Then, sup x,y l(θ 2 ; x, y) -l(θ 1 ; x, y) / R(θ 1 ) -R(θ 2 ) > C R . Proof. Consider any attainable pairs of (θ 1 , θ 2 ) such that R(θ 1 ) > R(θ 2 ) and let D 1 to be training set that the training algorithm produces the unique minimizer θ 1 . Namely, θ 1 = arg min θ 1 |D 1 | • L(θ; D 1 ) + C R • R(θ) Since θ 1 minimizes the total loss on D 1 uniquely, we have 1 |D 1 | L(θ 2 ; D 1 ) + C R • R(θ 2 ) > 1 |D 1 | L(θ 1 ; D 1 ) + C R • R(θ 1 ) By rearranging the above inequality and by an averaging argument, we have sup x,y l(θ 2 ; x, y) -l(θ 1 ; x, y) ≥ 1 |D 1 | L(θ 2 ; D 1 ) - 1 |D 1 | L(θ 1 ; D 1 ) > C R • R(θ 1 ) -R(θ 2 ) . Now since R(θ 1 ) > R(θ 2 ) we have sup x,y l(θ 2 ; x, y) -l(θ 1 ; x, y) / R(θ 1 ) -R(θ 2 ) > C R . Lemma A.2. Let F be the family of all C R -attainable models. For any θ 1 ∈ F , there is a constant γ where for all θ 2 ∈ F we have sup x,y l(θ 2 ; x, y) -l(θ 1 ; x, y) + C R (R(θ 2 ) -R(θ 1 )) > γ • sup x,y l(θ 2 ; x, y) -l(θ 1 ; x, y) where γ is a positive constant related to, θ 1 , C R and other model parameters (fixed for a given classification task). Proof. We prove the lemma for γ = 1 -C R /C for C =   inf θ2∈F s.t. R(θ1)>R(θ2) sup x,y (l(θ 2 ; x, y) -l(θ 1 ; x, y))/(R(θ 1 ) -R(θ 2 ))   . First, note that by Lemma A.1 we have C > C R ≥ 0. (2) which implies γ is positive. Now we consider two subcases based on the sign of R(θ 2 ) -R(θ 1 ): Case 1: R(θ 2 ) -R(θ 1 ) ≥ 0. In this case the inequality is straightforward: sup x,y l(θ 2 ; x, y) -l(θ 1 ; x, y) + C R • (R(θ 2 ) -R(θ 1 )) ≥ sup x,y l(θ 2 ; x, y) -l(θ 1 ; x, y) > (1 -C R /C) • sup x,y l(θ 2 ; x, y) -l(θ 1 ; x, y) , where the last inequality is based on equation 2. Case 2: R(θ 2 ) -R(θ 1 ) < 0. From the definition of C we have R(θ 1 ) -R(θ 2 ) ≤ sup x,y l(θ 2 ; x, y) -l(θ 1 ; x, y) C . Equivalently, we can say R(θ 2 ) -R(θ 1 ) ≥ - sup x,y l(θ 2 ; x, y) -l(θ 1 ; x, y) C . Replacing R(θ 2 ) -R(θ 1 ) with the lower bound above completes the proof, namely sup x,y l(θ 2 ; x, y) -l(θ 1 ; x, y) + C R (R(θ 2 ) -R(θ 1 )) ≥ (1 -C R /C) • sup x,y l(θ 2 ; x, y) -l(θ 1 ; x, y) . With Definition 2 and the lemmas, we are ready to prove where θ i = A (θ 0 , l 0 ), . . . , (θ i-1 , l i-1 ) and l i = S (θ 0 , l 0 ), . . . , (θ i-1 , l i-1 ), θ i . With the online learning problem set up, we proceed to the main proof which first describes Algorithm 1 in the FTL framework. Proof of Theorem 4.1. The FTL framework proceeds by solving all the functions incurred during the previous online optimization steps, namely, A FTL ((θ 0 , l 0 ), . . . , (θ i , l i )) = arg min θ∈Θ i j=0 l i (θ). Next, we describe how we design the ith loss function l i in each round of the online optimization. For the first choice, A FTL chooses a random model θ 0 ∈ Θ. In the first round (round 0), S θp uses the clean training set D c and the loss is set as S θp (θ 0 ) = l 0 (θ) = L(θ; D c ) + N • C R • R(θ). According to the FTL framework, A FTL returns model that minimizes the loss on the clean training set D c using the structural empirical risk minimization. For the subsequent iterations (i ≥ 1), the loss functions is defined as, given the latest model θ i , S θp first finds (x * i , y * i ) that maximizes the loss difference between θ i and a target model θ p . Namely, (x * i , y * i ) = arg max (x,y) l(θ i ; x, y) -l(θ p ; x, y) and then chooses the ith loss function as follows: S θp (θ 0 , l 0 ), . . . , (θ i-1 , l i-1 ), θ i = l i (θ) = l(θ; x * i , y * i ) + C R • R(θ). Now we will see how FTL framework behaves when working on these loss functions at different iterations. We use D i p to denote the set {(x * 1 , y * 1 ), . . . , (x * i , y * i )}. We have θ i = A FTL ((θ 0 , l 0 ), . . . , (θ i-1 , l i-1 )) = arg min θ∈Θ i-1 j=0 l j (θ) = arg min θ∈Θ L(θ; D c ) + N • C R • R(θ) + i-1 j=1 l(θ; x * i , y * i ) + C R • R(θ) = arg min θ∈Θ L(θ; D c ∪ D i-1 p ) + (N + i -1) • C R • R(θ) = arg min θ∈Θ 1 |D c ∪ D i-1 p | L(θ; D c ∪ D i-1 p ) + C R • R(θ) This means that A FTL algorithm, at each step, trains a new model over the combination of clean data and poison data so far (i -1 number of poisons). Now we want to see what is the translation of the Regret(A FTL , S θp , T ). If we can prove an upper bound on regret, namely if we show Regret(A FTL , S θp , T ) ≤ α(T ) for some function α, then we have T j=0 l j (θ j ) - T j=0 l j (θ p ) ≤ T j=0 l j (θ j ) -min θ∈Θ T j=0 l j (θ) ≤ α(T ) which implies T j=0 l j (θ j ) - T j=0 l j (θ p ) = L(θ c ; D c ) -L(θ p ; D c ) + N • C R • (R(θ c ) -R(θ p )) + T j=1 l j (θ j ) - T j=1 l j (θ p ) = L(θ c ; D c ) -L(θ p ; D c ) + N • C R • (R(θ c ) -R(θ p )) + T j=1 max x,y l(θ j ; x, y) -l(θ p ; x, y) + C R • (R(θ j ) -R(θ p )) ≤ α(T ) Therefore we have T j=1 max x,y l(θ j ; x, y) -l(θ p ; x, y) + C R • (R(θ j ) -R(θ p )) ≤ α(T ) + L(θ p ; D c ) -L(θ c ; D c ) + N • C R • (R(θ p ) -R(θ c )) Based on Lemma A.2, we further have T j=1 γ • max x,y l(θ j ; x, y) -l(θ p ; x, y) ≤ α(T ) + L(θ p ; D c ) -L(θ c ; D c ) + N • C R • (R(θ p ) -R(θ c )) Above inequality states that average of the maximum loss difference in all previous rounds is bounded from above. Therefore, we know that among the T iterations, there exist an iteration j * ∈ [T ] (with lowest maximum loss difference) such that the maximum loss difference of θ j * is -close to θ p with respect to the loss-based distance where = α(T ) + L(θ p ; D c ) -L(θ c ; D c ) + N • C R • (R(θ p ) -R(θ c )) T • γ . Theorem 4.1 characterizes the dependencies of on α(T ) and the constant term L(θ p ; D c ) - L(θ c ; D c ) + N • C R • (R(θ p ) -R(θ c )). To show the convergence of Algorithm 1, we need to ensure → 0 when T → +∞, which implies we need to show α(T ) ≤ O( √ T ). Following remark (restating Remark 1 in Section 4.2) and its proof shows the desired convergence. Remark 1. Online learning algorithms with sublinear regret bound can be applied to show the convergence. Here, we adopt the regret analysis from McMahan (2017). Specifically, α(T ) is in the order of O(log T )) and we have ≤ O( log T T ) when the loss function is Lipschitz continuous and the regularizer R(θ) is strongly convex, and → 0 when T → +∞. α(T ) is also in the order of O(log T ) when the loss function used for training is strongly convex and the regularizer is convex. Our FTL framework formulation can utilize the existing logarithmic regret bound of adaptive FTL algorithm when the objective functions are strongly convex with respect to some norm • , as illustrated in Section 3.6 in McMahan (2017) . For clarity in presentation, we first restate their related results below. Setting 1 (Setting 1 in McMahan ( 2017)). Given a sequence of objective loss functions f 1 , f 2 , ..., f i and a sequence of incremental regularization functions r 0 , r 1 , ..., r i we consider an algorithm that selects the response point based on θ 1 = arg min θ∈R d r 0 (θ) θ i+1 = arg min θ∈R d i j=1 f j (θ) + r j (θ) + r 0 (θ), for i = 1, 2, ... We simplify the summation notation with f 1:i (θ) = i j=1 f j (θ). Assume that r i is a convex function and satisfy r i (θ) ≥ 0 for i ∈ {0, 1, 2, ...}, against a sequence of convex loss functions f i : R d → R ∪ {∞}. Further, letting h 0:i = r 0:i + f 1:i we assume dom h 0:i is non-empty. Recalling θ i = arg min θ h 0:i-1 (θ), we further assume ∂f i (θ i ) is non-empty. We denote the dual norm of a norm • as • * . Theorem A.3 (Restatement of Theorem 1 in McMahan ( 2017)). Consider Setting 1, and suppose the r i are chosen such that r 0:i + f 1:i+1 is 1-strongly-convex w.r.t. some norm • (i)• . If we define the regret of the algorithm with respect to a selected point θ * as Regret T (θ * , f i ) ≡ T i=1 f i (θ i ) - T i=1 f i (θ * ). Then, for any θ * ∈ R d and for any T > 0, with g i ∈ ∂f i (θ i ), we have Regret T (θ * , f i ) ≤ r 0:T -1 (θ * ) + 1 2 g i 2 (i-1), * Corollary A.3.1 (Formalization of FTL result in Section 3.6 in McMahan ( 2017)). In the FTL framework (no individual regularizer is used in the optimization procedure), suppose each loss function f i is 1-strongly convex w.r.t. a norm • , then we have Regret T (θ * , f i ) ≤ 1 2 T i=1 1 i g i 2 * ≤ G 2 2 (1 + log T ) with g i * ≤ G. Proof. The following proof is a restatement of the proof in Section 3.6 in McMahan (2017) . The proof follows from Theorem A.3. Since we are considering the FTL framework, let r i (θ) = 0 for all i and define θ (i) = √ i θ . Observe that h 0:i (i.e., f 1:i ) is 1-strongly convex with respect to θ (i) (Lemma 3 in McMahan ( 2017)), and we have θ (i), * = 1 √ i θ * . Then by applying Theorem A.3, we have Regret T (θ * , f i ) ≤ 1 2 T i=1 g i 2 (i), * = 1 2 T i=1 1 i g i 2 * Based on the inequality of T i=1 1/i ≤ 1 + log T and if we further assume g i * ≤ G, then we can have 1 2 T i=1 1 i g i 2 * ≤ G 2 2 (1 + log T ) Proof of Remark 1. We will prove the logarithmic regret bound in Remark 1 utilizing Corollary A.3.1. First of all, our online learning process fits into Setting 1. Specifically, we set r i (θ) = 0 for all i. For f i (θ), when 1 ≤ i ≤ N , we set f i (θ) = 1 N L(θ; D c ) + C R • R(θ) (evenly distributing the term L(θ; D c ) + N • C R • R(θ) across N iterations) and when i ≥ N + 1, we set f i (θ) = l i-N (θ). Details of l i can be referred from the proof of Theorem 4.1. Therefore, f i is 1-strongly convex with respect to a norm • (the norm is determined by the regularizer R(θ) and C R ). Further, l 0:i (θ) = f 1:N +i (θ). In addition, the assumption that dom h 0:i is non-empty in Setting 1 means when if we train a classifier on the poisoned data set, we can always return a model and hence the assumption is satisfied. The assumption of the existence of subgradient ∂f i (θ i ) in Setting 1 is also satisfied by the poisoning attack scenario. The logarithmic regret of Regret(A FTL , S θp , T ) of our algorithm then follows from the result of Regret T (θ * , f i ) in Corollary A.3.1. Specifically, l 0:i (θ) = f 1:N +i (θ) is 1-strongly convex to norm • i = √ N + i • and since we assume the loss function is G-Lipschitz, we have g i * ≤ G. Therefore, we have the logarithmic regret bound as: Regret(A FTL , S θp , T ) ≤ α(T ) = 1 2 T i=1 1 i + N g i 2 * ≤ 1 2 T i=1 1 i g i 2 * ≤ G 2 2 (1+log T ) ≤ O(log T ). We next provide the proof of the certified lower bound (restating Theorem 4.2 from Section 4.3): Theorem 4.2. Given a target classifier θ p , to reproduce θ p by adding the poisoning set D p into D c , the number of poisoning points |D p | cannot be lower than sup θ z(θ) = L(θ p ; D c ) -L(θ; D c ) + N C R (R(θ p ) -R(θ)) sup x,y l(θ; x, y) -l(θ p ; x, y) + C R (R(θ) -R(θ p )) . The main intuition behind the theorem is, when the the number of poisoning points added to the clean training set is lower than the certified lower bound, for structural empirical risk minimization problem (shown in equation 1 in the main paper), then target classifier will always have higher loss than another classifier and hence cannot be achieved. Proof. We first show that for all models θ, we can derive a lower bound on the number of poison points required to get θ p . Then since these lower bounds all hold, we can take the maximum over all of them and get a valid lower bound. We first show that for any model θ, the minimum number of poisoning points cannot be lower than z(θ) = L(θ p ; D c ) -L(θ; D c ) + N C R (R(θ p ) -R(θ)) sup x,y l(θ; x, y) -l(θ p ; x, y) + C R (R(θ) -R(θ p )) . Let us denote the point corresponding to the supremum of the loss difference between θ and θ p as (x * , y * )foot_1 . Namely, l(θ; x * , y * ) -l(θ p ; x * , y * ) = sup x,y l(θ; x, y) -l(θ p ; x, y) . Now suppose we can obtain θ p with lower number of poisoning points z < z(θ). Assume there is a poisoning set D p with size z such that when added to D c would result in θ p . We have sup x,y l(θ; x, y) -l(θ p ; x, y) ≥ 1 |D c ∪ D p | L(θ; D c ∪ D p )- 1 |D c ∪ D p | L(θ p ; D c ∪ D p ) > C R • R(θ p ) -R(θ) , implying sup x,y l(θ; x, y) -l(θ p ; x, y) + C R • (R(θ) -R(θ p )) > 0. Based on the assumption that z < z(θ), and the fact that sup x,y l(θ; x, y) -l(θ p ; x, y) + C R • (R(θ) -R(θ p )) > 0, we have z • l(θ; x * , y * ) -l(θ p ; x * , y * ) + C R (R(θ) -R(θ p )) < z(θ) • l(θ; x * , y * ) -l(θ p ; x * , y * ) + C R (R(θ) -R(θ p )) = L(θ p ; D c ) -L(θ; D c ) + N C R (R(θ p ) -R(θ)). where the equality is based on the definition of z(θ). On the other hand, by definition of (x * , y * ) for any D p of size z, we have L(θ; D p ) -L(θ p , D p ) + z • (C R • R(θ) -C R • R(θ p )) ≤ z • l(θ; x * , y * ) -l(θ p ; x * , y * ) + C R (R(θ) -R(θ p )) . The above two inequalities imply that for any set D p with size z we have 1 |D c ∪ D p | L(θ; D c ∪ D p ) + C R • R(θ) < 1 |D c ∪ D p | L(θ p ; D c ∪ D p ) + C R • R(θ p ). which indicates that adding D p poisoning points into the training set D c , the model θ has lower loss compared to θ p , which is a contradiction to the assumption that θ p has lowest loss on D c ∪ D p and can be achieved. Now, since θ p needs to have lower loss on D c ∪ D p compared to any classifier θ ∈ Θ, the best lower bound is the supremum over all models in the model space Θ. Corollary 4.2.1. If we further assume bi-directional closeness in the loss-based distance, we can also derive the lower bound on number of poisoning points needed to induce models that are -close to the target model. More precisely, if θ 1 being -close to θ 2 implies that θ 2 is also k • close to θ 1 , then we have, sup θ z (θ) = L(θ p ; D c ) -L(θ; D c ) -N C R • R * -N k sup x,y l(θ; x, y) -l(θ p ; x, y) + C R • R * + k . where R * is an upper bound on the nonnegative regularizer R(θ). Proof of Corollary 4.2.1. The lower bound for all -close models to the target classifier is given exactly as follows: inf θ -θp D l,X ,Y ≤ sup θ z(θ, θ ) = L(θ ; D c ) -L(θ; D c ) + N C R (R(θ ) -R(θ)) sup x,y l(θ; x, y) -l(θ ; x, y) + C R (R(θ) -R(θ )) , where inf θ -θp D l,X ,Y ≤ denotes θ is -close to θ p in the loss-based distance. However, the formulation above is a min-max optimization problem and hard to analytically compute the lower bound (by plugging the lower bound formula into Algorithm 1. Therefore, we need to make several relaxations such that the lower bound is computable. For any model θ that is -close to θ p , based on the bi-directional assumption, then θ p is k -close to θ . Therefore we have, L(θ ; D c )-L(θ; D c ) = L(θ ; D c )-L(θ p ; D c )+L(θ p ; D c )-L(θ; D c ) ≥ -N k +L(θ p ; D c )-L(θ; D c ) and sup x,y l(θ; x, y) -l(θ , x, y) = sup x,y l(θ; x, y) -l(θ p , x, y) + sup x,y l(θ p , x, y) -l(θ ; x, y) ≤ sup x,y l(θ; x, y) -l(θ p , x, y) + k and the inequalities are all based on the definition of θ p being k -close to θ . Plugging the above inequalities into the formula of sup θ,θ for model θ , and with the assumption that 0 ≤ R(θ) ≤ R * , ∀θ ∈ Θ, we immediately have sup θ z(θ, θ ) ≥ sup θ L(θ p ; D c ) -L(θ; D c ) -N k + N C R (R(θ ) -R(θ)) sup x,y l(θ; x, y) -l(θ p ; x, y) -k + C R (R(θ) -R(θ )) ≥ sup θ L(θ p ; D c ) -L(θ; D c ) -N k -N C R • R * sup x,y l(θ; x, y) -l(θ p ; x, y) -k + C R • R * = z (θ) . Since the inequality holds for any θ , we have inf θ -θp D l,X ,Y ≤ sup θ z(θ, θ ) ≥ sup θ z (θ) and hence z (θ) is a valid lower bound. Remark 2 (Improving Results in Corollary 4.2.1). Assuming 0 ≤ R(θ) ≤ R * is not a strong assumption and actually can be satisfied by many common convex models. For example, for SVM model with 2 -regularizer (in fact, applies to any regularizer R(θ) with R(0) = 0), we have R(θ) ≤ 1 C R and hence R * ≤ 1 C R . Moreover, we can further tighten the lower bound by better bounding the term R(θ ) -R(θ). Specifically, R(θ ) -R(θ) = R(θ ) -R(θ p ) + R(θ p ) -R(θ) and we only need to have a tighter upper and lower bounds on R(θ ) -R(θ p ) utilizing some special properties of the loss functions. For the constant k in the bi-directional closeness, we can also compute its value for some specific loss functions. For example, for Hinge loss, we can compute the value based on Corollary B.2.1 in Appendix B.

PARAMETERS

In theorem below, we show how one can relate the notion of -closeness in Definition 1 in the main paper to closeness of parameters in the specific setting of hinge loss. We use this just as an example to show that our notion of -closeness can be tightly related to the closeness of the models. Theorem B.1. Consider the hinge loss function l(θ; x, y) = max(1 -y • x, θ , 0) for θ ∈ R d and x ∈ R d and y ∈ {-1, +1}. For θ, θ ∈ R d such that θ 1 ≤ r and θ 1 ≤ r, if θ is -close to θ in the loss-based distance, then, θ -θ 1 ≤ r • . Remark 3. In Theorem B.1 above with 2 -regularizer, an upper bound on the 1 -norm of θ and θ is d/C R . however, the models that we care about in practice usually have smaller norms. Remark 3 can be obtained by plugging 0 ∈ R d and compare the resulting (regularized) optimization loss to the model θ * that minimizes the model loss. Proof of Theorem B.1. We construct a point x * as follows: x * i = -1 r , if θ i > θ i , i ∈ [d] + 1 r if θ i ≤ θ i , i ∈ [d] Then we have θ -θ , x * = 1 r • θ -θ 1 (3) Since θ 1 ≤ r we have x * , θ ≥ -1 (4) and similarly since θ 1 ≤ r we have x * , θ ≥ -1. (5) Therefore by Inequalities equation 4 and equation 5 we have l(θ; x * , -1) -l(θ ; x * , -1) = max(1 + x * , θ , 0) -max(1 + x * , θ , 0) = θ -θ , x * which by equation 3 implies l(θ; x * , -1) -l(θ ; x * , -1) = 1 r • θ -θ 1 . Now since we know that, ∀x ∈ R d , the loss difference between θ and θ is bounded by , the bound should also hold for the point (x * , -1), meaning that 1 r • θ -θ 1 ≤ . which completes the proof. Theorem B.2. Consider the hinge loss function l(θ; x, y) = max(1 -y • x, θ , 0) for θ ∈ R d and x ∈ R d and y ∈ {-1, +1}. For X = {x ∈ R d : x 1 ≤ q} and Y = {-1, +1}, For any two models θ, θ if θ -θ 1 ≤ , then θ is q • -close to θ in the loss-based distance. Namely, D ,X ,Y (θ, θ ) ≤ q • . Proof. For any given θ and θ , by triangle inequality for maximum, we have l(θ; x, y) -l(θ , x, y) = max(1 -y • x, θ , 0) -max(1 -y • x, θ , 0) ≤ max(0, yx, θ -θ ). Therefore, we have max (x,y)∈X ×Y l(θ; x, y) -l(θ ; x, y) ≤ max (x,y)∈X ×Y max(0, yx, θ -θ ). Our goal is then to obtain an upper bound of O( ) for max (x,y)∈X ×Y yx, θ -θ when θ -θ 1 ≤ . To maximize yx, θ -θ by choosing x and y, we only need to ensure that sign yx i = sign θ i , i ∈ [d]. Therefore, based on the assumption that 1 q x ≤ 1 (i.e., 1 q |x i | ≤ 1, i ∈ [d]) we have max (x,y)∈X ×Y 1 q yx, θ -θ = d i=1 1 q |x| i |θ i -θ i | ≤ d i=1 |θ i -θ i | = θ -θ 1 ≤ , which concludes the proof. Corollary B.2.1. For Hinge loss, with Theorem B.1 and Theorem B.2, if θ is -close to θ , then θ is r • q • -close to θ. (7) Now we can calculate C as follows C =   inf θ∈F s.t. R(θp)>R(θ) sup x,y (l(θ; x, y) -l(θ p ; x, y))/(R(θ p ) -R(θ))   ≥   inf θ∈F s.t. R(θp)>R(θ) (l(θ, x * θ , +1) -l(θ p , x * θ , +1))/(R(θ p ) -R(θ))   (By Inequality 7) ≥ inf θ∈F s.t. R(θp)>R(θ) sup x,y 1 -α θ R(θ p ) -R(θ) (By definition of α θ ) ≥ inf θ∈F s.t. R(θp)>R(θ) 1 -α θ R(θ p )(1 -α 2 θ ) ≥ inf θ∈F s.t. R(θp)>R(θ) 1 -α θ R(θ p )(1 -α 2 θ ) ≥ 1 2R(θ p ) Therefore γ ≥ 1 -2 • C R • R(θ p ). On the other hand, we can also calculate α(T ) based on the exact form given in the proof of Theorem 4.1.

D INDISCRIMINATE SETTING EXPERIMENTS

In this section, we evaluate the attacks in the conventional indiscriminate attack setting, where the attacker's goal is just to reduce the overall accuracy of the model.

Datasets and Models.

For the indiscriminate attack, we use the MNIST 1-7 dataset, which consists of the digits 1 and 7 and is commonly used for evaluating indiscriminate poisoning attacks against binary classification (Steinhardt et al., 2017; Biggio et al., 2012; Xiao et al., 2012) . MNIST 1-7 contains 13,007 training and 2,163 test samples. The dataset contains 784 features and all the features are normalized into range [0, 1]. For completeness, the Adult dataset used for subpopulation attack is downsampled to form a class-balanced dataset and contains 15682 training data and 7692 test data. The dataset contains 57 features and the features are also normalized into range [0, 1] (except for the binary features). All of the processed datasets are included in the supplementary material. We still adopt linear SVM model in the indiscriminate attack scenario. All of the models for both datasets set the regularization parameter C R = 0.09. The clean accuracy of SVM model on MNIST 1-7 is 98.9% and the accuracy on Adult dataset is 78.5%. Target Classifiers. Accuracy of the clean MNIST 1-7 model has around 1% error rate on the test set. For our experiment, we aim to generate three target classifiers with overall test errors around 5%, 10% and 15%. To generate target classifiers with desired error rates, we follow the heuristic strategy proposed by Koh et al. (2018) to generate multiple candidate target classifiers, and then among all the valid candidate models that satisfy the error rate requirement we choose the one with lowest loss on the clean training set. Using this approach, the final target classifiers induced have overall test accuracy of 94.0%, 88.8% and 83.3% respectively. (We describe a better way of finding the target classifiers in Appendix D.1, but for comparison purposes do not use those in the results here.) Convergence. We show the convergence of Algorithm 1 by reporting the maximum loss difference and Euclidean distance between the classifier induced by the attack and the target classifier. Figures 3a and 3b summarize the results for the target classifier with a 10% error rate. The maximum number of poisoning points in the figure is obtained when the classifier from Algorithm 1 is 0.1-close to the target classifier in the loss-based distance. In Figure 3 , the classifier induced by our algorithm steadily converges to the target classifier both in the maximum loss difference and Euclidean distance, while the classifier induced by the KKT attack diverges initially and then starts to converge to the target model. At the maximum number of points, the maximum loss difference of KKT-induced classifier to the target is 0.46, compared to 0.1 for the classifier induced by our attack. For the Euclidean distance, the KKT-induced classifier is 0.16 away, compared to 0.07 for the classifier induced by our attack. Attack Success. We next compare the classifier induced from our attack to the classifier induced by the KKT attack in terms of their overall test accuracy. Similarly, the maximum number of poisoning points in Figure 4 is obtained by running our attack with 0.1-closeness (in loss-based distance) to the target as the input. In terms of the test accuracy, our attack has a comparable attack success rate compared to the KKT attack. Specifically, for target models of 5% and 10% error rates, both methods have almost identical performance, as shown in Figures 4a and 4b . For the target model of 15% error rate (test accuracy is 83.3%), the KKT attack is more successful than our attack, inducing models with 82.7% accuracy (17.3% attack success rate) compared to 85.9% accuracy (14.1% attack success rate) for our attack. Interestingly, in this case, the performance of models induced by the two attacks with fewer poisoning points our attack results in models with lower test accuracy (higher attacker success) than the KKT attack. To summarize, for the indiscriminate scenario, our attack produces classifiers that have much closer distance to the target models than the KKT attack, and has comparable attack success rates with the KKT attack. Lower Bound on Number of Poisons. We next check the optimality of our attack in the indiscriminate attack scenario. Similar to the subpopulation attack setting, we still use Theorem 4.2 to compute the lower bound of the induced classifier from our attack by using it as the input to Algorithm 1, and terminating when the induced classifier is 0.1-close to the given target model. In All results are averaged over four runs. An integral value in a cell means we get exactly that same value for all four runs; for the one cell where we observe variation, we report the average and standard error. 3 ). The maximum number of poisons is set using the 0.1-close threshold to KKT induced classifier. calculated lower bound shows that there exists a relatively large gap between the number of poisoning points, especially for the induced classifier from our attack for the target model of 5% error rate, where the lower bound is only 50% of the actual number of poisoning points used. For the induced classifier for the target model of 15% error rate, the gap between the number of poisoning points and the lower bound is smallest, with the lower bound taking 79% of the number of poisoning points. The relatively large gap indicates that either the estimated lower bound is not tight or the attack itself is not close to optimal. To gain more insights into this problem, we further show the computed lower bound at each iteration when running Algorithm 1 and Figure 5 summarizes the results. From the Figure, we see that, the peak value of the curve (i.e., highest lower bound) always appears before the termination of the algorithm, indicating that the computed lower bound is likely to be tight and we may need to further improve the attack algorithm. For completeness, we also repeat the lower bound computation process for classifiers induced from the KKT attack. The KKT induced classifiers are obtained by running the KKT attack with target classifiers of different error rates as target input. The target number of poisoning points of KKT attack is given by the size of poisoning set returned from our attack when our algorithm terminates when the induced classifier is 0.1-close to the target model of different error rates. Then the lower bound computation process is identical to the above -we simply send the KKT induced classifier as target input to Algorithm 1 and terminate it when the induced classifier from our algorithm is 0.1-close to the given target model (i.e., KKT induced classifier). The results are summarized in Table 3 and we observe that there also exists a relatively large gap between the lower bound and the number of poisoning points used by the attack. Similarly, we also plot the lower bound computed at different iterations of Algorithm 1 in Figure 6 , and find that the peak value also appears before the termination of the algorithm, indicating that the lower bound might be tight and we need a stronger attack strategy to close the gap.

D.1 IMPROVED TARGET GENERATION PROCESS

The original heuristic approach works by finding different quantiles of training points that have higher loss on the clean model, flipping their labels, repeating those points for multiple copies, and adding them to the clean training set. We find that, in the process of trying different quantiles and copies of high loss points, if we also adaptively update the model where the high loss points are found (instead of just always fixing it to be the clean model), we can generate a valid target classifier with even lower loss on the clean training. Such an improved generation process can significantly reduce the number of poisoning points needed to reach the same -closeness (with respect to the loss-based distance) to the target classifier, consistent with the claims in Theorem 4.1 in the main paper. In addition, we find that, if we compare our attack with improved generation process to the KKT attack with the original generation process (Koh et al., 2018) , we can also reach the desired target error rate much faster using our attack. Implication of Theorem 4.1. We first empirically validate the implication of Theorem 4.1 in the main paper: to obtain the same -closeness in loss-based distance, a target classifier with lower loss on the clean training set D c requires fewer poisoning points. Therefore, when adversaries have multiple target classifiers that satisfy the attack goal, the one with lower loss on clean training set is preferred. For both the original and improved target generation methods, we generate three target classifiers with error rates of 5%, 10% and 15%. The original classifier generation method returns classifiers with test accuracy of 94.0%, 88.8% and 82.3% respectively (also used in the previous experiments of indiscriminate attack). The improved target generation process returns target classifiers with approximately the same test accuracy (94.9%, 88.9% and 84.5%). However, for classifiers returned from the two generation methods of same error rate, the improved generation method produces classifiers with significantly lower loss compared to the original generation approach. Table 4 compares the two target generation approaches by showing the number of poisoning points needed to get 0.1-close to the corresponding target model of same error rate. For example, for target models of 15% error rate, the model from the original approach has a total clean loss of 5428.4 while our improved method reduces it to 4641.6. With the reduced clean loss, getting 0.1-close to the target model generated from our improved process only requires 3206 poisoning points, while reaching the same distance from the target model produced by the original method would require 6762 poisoning points, a more than 50% reduction. End-to-End Comparison. Figure 7 compares the two attacks in an end-to-end manner in terms of their test accuracy. With the improved target generation process, our attack can achieve the desired error rate much faster than the KKT attack with the original process. For the KKT attack with target model generated from the original process, we determine the target number of poisoning points using the size of poisoning set returned from running Algorithm 1 with 0.1-closeness and target model from original process as inputs. To run our attack with improved generation process, we terminate the algorithm when the size of the poisoning points is same as the number of poisoning points used by the KKT attack with original process. Such a termination criteria helps us to ensure that both attacks use same number of poisoning points and can be compared easily. We also evaluate the KKT attack on fractions of the maximum target number of poisoning points (0.2, 0.4, 0.6, and 0.8), as in the previous experiments. The accuracy plot shows that our attack (with improved target model) can achieve the desired error rate (e.g., 10%) much faster than the KKT attack (with original target model), especially for the target classifiers of error rates of 10% and 15%.

E COMPARISON OF MODEL-TARGETED AND OBJECTIVE-DRIVEN ATTACKS

Although model-targeted attacks work to induce the given target classifiers by generating poisoning points, the end goal is still to achieve the attacker objectives encoded in the target models. In terms of the comparison to the objective-driven attacks, we first demonstrate that, objective-driven attacks can be used to generate a target model, which can then be used as the target for a model-targeted attack, resulting in an attack that achieves the desired attacker objective with fewer poisoning points. Then, we show that in order to have competitive performance against state-of-the-art objective-driven attacks (e.g., the min-max attack (Steinhardt et al., 2017) ), the target classifiers should be generated carefully, such that the attacker objectives of the target classifiers can be achieved efficiently with model-targeted attacks using fewer poisoning points. Although the investigation of a systematic approach to generate such "desired" classifiers is out of the scope of this paper, in the indiscriminate setting, we have some empirical evidence. Specifically, we find that target classifiers with lower loss on clean training set and higher error rates (higher than what are desired in the attacker objectives) often require fewer poisoning points to achieve the attacker objectives. The following experiments are conducted on the MNIST 1-7 dataset. Target Models Generated from Objective-driven Label-Flipping Attacks. In our experiments, the target classifiers are generated from the label flipping based objective driven attacks that are effective but need too many poisoning points to achieve their objective. Then, our attacks are deployed to achieve the same objective with fewer poisoning points. Table 5 shows the number of poisoning points used by the label-flipping attack and our model-targeted attack, to achieve desired attack objectives of increasing the test error to certain amount. We can see that, using our attack, the number of poisoning points used by label-flipping attacks can be saved up to 73%. Comparison to the Min-Max Attack. Still using target classifiers generated from label-flipping attacks, we show that our attack can outperform the state-of-the-art min-max attack (Steinhardt et al., 2017) at reducing the overall test accuracy, under same amount of poisoning points. Since we aim to produce target classifiers with lower loss on clean training set and higher error rates, we adopt the improved target model generation process described in Section D.1 (helps to reduce the loss on clean training set) and generate a classifier of 15% error rate. With the target model, we terminate our attack when the induced model is 0.1-close to the target model in terms of the loss-based distance. However, to we compare our attack to the min-max attack conveniently, we compare their accuracy reduction at different poisoning ratios (i.e., 5%, 15% and 30%), which is the common evaluation strategy of objective-driven attacks in the indiscriminate setting (Biggio et al., 2011; Steinhardt et al., 2017; Koh et al., 2018) . Table 6 summarizes the results. From the table, we observe that, compared to the min-max attack, our attack reduces more on the test accuracy under the same poisoning budget and the gap becomes larger when the poisoning ratio increases. Comparison to the Label-Flipping Subpopulation Attack. We also compare our attack to the label-flipping subpopulation attack from Jagielski et al. (2019) . This attack works by randomly sampling fixed number (constrained by the poisoning budget) of instances from the training data of the subpopulation, flipping their labels and then injecting them into to the original training set. Although this attack is very simple, it shows relatively high attack success when the goal is to cause misclassification on the selected subpopulation (Jagielski et al., 2019) . To be consistent with our experiments in Section 5, we assume the attacker objectives are still to induce a model that has 0% accuracy on a selected subpopulation. For each of the SVM and logistic regression models, we selected the three subpopulations with highest test accuracy (all end up having have 100% accuracy). In indiscriminate setting, we already observed that models with lower loss on clean training set and larger overall error rates can achieve attacker objectives of smaller error rates faster. However, to leverage this observation into our subpopulation experiments, one challenge is the attacker objective is to have 100% test error on the subpopulation, but no classifiers can have test errors larger than 100%. To tackle this, we select models with larger loss on training samples from the subpopulation, with a hope that this process is "equivalent" to selecting target models with larger error rates (on subpopulation) than 100%. To this end, we heuristically select targeted models that satisfy the attacker objective, have larger loss on the training data from the subpopulation, and have relatively low loss on the entire clean training set. Empirically, this selection strategy works better than the original target generation process (as done in Section 5) in achieving the attacker objectives. A more detailed and systematic investigation of the target model search process is left as the future work. To check the effectiveness of achieving the attacker objectives, we first run our attack and terminate when our attack achieves the attacker objective to have 0% accuracy on the selected subpopulation, and record the number of poisoning points used. Then, we run the random label-flipping attack with the same number of poisoning points. For both attacks, we report the final test accuracies of the resulting models on the subpopulations. The attack comparisons on different subpopulation clusters and models are given in Table 7 . Results in the table compare our attack and the label-flipping attack over the three distinct subpopulation clusters for the SVM and logistic regression models. Across all settings, our attack is considerably more successful. The number of poisoning points needed to reach the 0% accuracy goal is small compared to the entire training set size (e.g., the maximum poisoning ratio is only 10.5%). The gap between our attack and the label-flipping attack is fairly small. For example, for Cluster 1 in the SVM experiment, the label-flipping attack is also quite successful and reduces the test accuracy to 2.8% (our attack achieves 0% accuracy). We believe the success of label-flipping attack is due to the following two reasons. First, label-flipping in the subpopulation setting can be successful because smaller subpopulations show some degree of locality and hence, injecting points (from the subpopulation) with flipped labels can have a strong impact on the selected subpopulation. This is confirmed by empirical evidence that increasing the subpopulation size (i.e., reducing its locality) gradually reduces the label-flipping effectiveness and the attack becomes much less effective in the indiscriminate setting (i.e., subpopulation is the entire population). Second, the Adult dataset only contains 57 features, where 53 of them are binary features with additional constraints. Therefore, the benefit from optimizing the feature values is less significant as the optimization search space of our attack is fairly limited.

F ADDITIONAL EXPERIMENTS

In this section, we provide the results of evaluating our attack and the KKT attack on SVM models for an additional dataset (Section F.1) and on logistic regression models for three datasets (Section F.2). The results here are consistent with our findings in on the datasets and models in Section 5, but provide further evidence of the general effectiveness of our attack.

F.1 ATTACKS ON SVM TRAINED ON DOGFISH

In this section, we introduce the results of SVM model evaluated on the Dogfish dataset. Dataset. The Dogfish dataset contains dog and fish images of dimensions 298 × 298 × 3. This dataset has been used as a binary classification task by by previous works in evaluating poisoning attacks in the indiscriminate setting (Koh & Liang, 2017; Steinhardt et al., 2017; Koh et al., 2018) . Table 8 : SVM on Dogfish dataset: poisoning points needed to achieve target classifiers induced from our attack. Top row means number of poisoning points used by our attack. Bottom row means the lower bound computed from Theorem 4.2 for the induced classifiers. To make classification easy for linear models, both Steinhardt et al. (2017) and Koh et al. (2018) use the extracted features (of the original images) from the ImageNet Inception model (Szegedy et al., 2016) and then apply linear models to complete the classification task. We also adopt the extracted features for classification, so each instance has 2,048 features. The Dogfish dataset has 1,800 training samples and 600 test samples. As in the previous work, we conduct a conventional indiscriminate attack on it, where the adversary's goal is to reduce the overall test accuracy. Target Classifiers. The clean accuracy of the SVM model on the Dogfish dataset is 98.5%. We generate target classifiers of overall test error rates around 10%, 20% and 30% using a similar (heuristic) target generation process (Koh et al., 2018) as in the case of MNIST 1-7 dataset. The final test accuracies of the obtained target classifiers are 89.3%, 78.3% and 67.2% respectively. Attack Results. The convergence of our attack is demonstrated in Figure 8 , both in the loss-based distance and the actual model distance in 2 -norm. The maximum number of poisons for the experiments is obtained when the classifier from Algorithm 1 is 2.0-close to the target classifier. Our attack steadily converges to the target model for both metrics, and has a faster convergence rate than the KKT attack. Similar observations are found in other attack settings, as summarized in Figure 9 . Our attack is slightly more successful than the KKT attack in all three target classifiers we tested. For the models induced from our attack, in Table 8 , we also show the gap between the number of the KKT attack diverges. Similar observations are also found for other attack settings, as shown in Figure 13 . In these settings, our attack is much more successful than the KKT attack. In fact, the KKT attack seems to not find a useful set of poisoning points as its induced models did not show a significant drop from the clean accuracy of 98.1%. We suspect this is due to the highly non-convex nature of the attacker objectivefoot_3 when attacking logistic regression models. In contrast, our attack only deals with maximizing the difference of two logistic losses, which is simpler than the KKT attacker objective, and results in a successful attack. Results on Dogfish. Figure 14 shows attack results of logistic regression models on the Dogfish dataset. The maximum number of poisons for the experiments is obtained when the classifier from Algorithm 1 is 1.0-close to the target classifier. Our attack converges to the target model while the KKT attack fails to converge. Attack success comparisons are given in Figure 15 . As with MNIST 1-7, our attack succeeds in settings where the KKT attack does not lead to significant accuracy drops from the clean accuracy of 98.5%. We believe this is also because of the highly non-convex nature of the KKT attack objective.



Similar to previous works, in this paper, we focus on designing a model-targeted attack that works for any achievable target model and leave the exploration of finding better target classifiers as the future work. In practice, the data space X is a closed convex set and hence, we can find (x * , y * ) using convex optimization. In other words, as we saw in experiments, calculating the lower bound is possible in practical scenarios. In the lower bound computation, when the loss difference is not concave, we can use a concave upper bound for it to obtain a valid lower bound and as long as the upper bound is relatively tight, the lower bound is still meaningful. The attacker objective is related to minimizing norm of the gradient, and becomes complicated for logistic regression models. Details of the attack formulation are inKoh et al. (2018).



1. N data points are drawn uniformly at random from the true data distribution over X × Y and form the clean training set, D c . 2. The adversary, with knowledge of D c , the model training process and the model space Θ, generates a target classifier θ p ∈ Θ that satisfies the attack goal. 3. The adversary produces a set of poisoning points, D p , with the knowledge of D c , model training process, Θ and θ p . 4. Model builder trains the model on D c ∪ D p and produces a classifier, θ atk .

ModelTargetedPoisoning Input: D c , the loss functions (L and l), θ p Output: D p 1: D p = ∅ 2: while stop criteria not met do 3: θ t = arg min L(θ; D c ∪ D p ) 4: (x * , y * ) = arg max X ×Y l(θ t ; x, y) -l(θ p ; x, y) 5: D p = D p ∪ {(x * , y * )} 6: end while 7: return D p Algorithm 1 requires the input of clean training set D c , the Loss function (L for set of points and l for individual point) and the target model θ p . The output from Algorithm 1 will be the set of poisoning points D p . The algorithm is simple: first, adversaries train the intermediate model θ t on the mixture of clean and poisoning points D c ∪ D p with D p an empty set in first iteration (Line 3).

Figure 1: Attack convergence (results shown are for the first subpopulation, Cluster 0). The maximum number of poisons is set using the 0.01-close threshold to target classifier.

After at most T steps, Algorithm 1 will produce the poisoning set D p and the classifier trained on D c ∪ D p is -close to θ p , with respect to loss-based distance, D l,X ,Y , for= α(T ) + L(θ p ; D c ) -L(θ c ; D c ) T • γwhere, γ is a constant for a given θ p and classification task, and α(T ) is the regret of the online algorithm when the loss function used for training is convex.The goal of the adversary is to get -close to θ p (in terms of the loss-based distance) by injecting (potentially few) number of poisoned training data. The algorithm is in essence an online learning problem and we transform Algorithm 1 into the form of standard online learning problem. Specifically, we adopt the follow the leader (FTL) framework to describe Algorithm 1 in the language of standard online learning problem. We first describe the online learning setting considered in this paper and the notion of the regret. Definition 3. Let L be a class of loss functions, Θ set of possible models, A : (Θ × L) * → Θ an online learner and S : (Θ × L) * × Θ → L a strategy for picking loss functions in different rounds of online learning (adversarial environment in the context of online convex optimization). We use Regret(A, S, T ) to denote the regret of A against S, in T rounds. Namely,

Figure 3: Attack convergence (results shown are for the target classifier of error rate 10%). The maximum number of poisons is set using the 0.1-close threshold to target classifier

Figure 5: Lower bound computed in each iteration of running algorithm 1 when the target classifier of the algorithm is the classifier induced from our Attack (classifier in Table2). The maximum number of poisons is set using the 0.1-close threshold to classifier induced from our attack.

Figure 6: Lower bound computed in each iteration of running algorithm 1 when the target classifier of the algorithm is the classifier induced from the KKT Attack (classifier in Table3). The maximum number of poisons is set using the 0.1-close threshold to KKT induced classifier.

Figure 7: Test accuracy with classifiers obtained from our attack and KKT attack. Target model for KKT attack is generated from the original generation process and target model for our attack is generated from the improved generation process. Maximum number of poisoning points is obtained by running our attack with target model generated from the original process and resultant classifier is 0.1-close to the target.

Figure 8: SVM on Dogfish dataset: attack convergence (results shown are for the target classifier of error rate 10%). The maximum number of poisons is set using the 2.0-close threshold to target classifier

Figure 9: SVM on Dogfish dataset: test accuracy of each target model of given error rate with classifiers induced by poisoning points obtained from our attack and the KKT attack.

Figure 10: Logistic regression model on Adult dataset: attack convergence (results shown are for the first subpopulation, Cluster 0). The maximum number of poisons is set using the 0.05-close threshold to target classifier.

Figure 12: Logistic regression model on MNIST 1-7 dataset: attack convergence (results shown are for the target classifier of error rate 10%). The maximum number of poisons is set using the 0.1-close threshold to target classifier.

Figure 14: Logistic regression model on Dogfish dataset: attack convergence (results shown are for the target classifier of error rate 10%). The maximum number of poisons is set using the 1.0-close threshold to target classifier





Poisoning points needed to achieve target classifiers induced from our attack. Top row means number of poisoning points used by our attack. Bottom row means the lower bound computed from Theorem 4.2 for the target classifiers.

Comparison of two target generation methods on number of poisoning points used to reach 0.1-closeness to the target. Original indicates the original target generation process from Koh et al. (2018). Improved denotes our improved target generation process with adaptive model updating.

Generate target classifiers using objective-driven label-flipping attacks and achieve similar attacker objectives using our attack with fewer poisoning points. The attacker objectives are to increase the test error to certain amounts (i.e., 5%, 10% and 15%) and the target classifiers to our attack are generated by running the label-flipping attacks with given attacker objectives.

Comparison of our attack to the min-max attack with different poisoning ratios. The target model of our attack is of 15% error rate. The poisoning ratio is with respect to the full training set size of 13,007. Each cell in the table denotes the test accuracy of the classifier after poisoning. The clean test accuracy is 98.9%. Our attack at 30% poisoning ratio is marked with "*" because the attack terminates when the induced model is 0.1-close to the target model, which only uses 2,894 poisoning points and is less than the 30% ratio.

Comparison of our attack to the label-flipping based subpopulation attack. The table compares the test accuracy on subpoplation of Adult dataset under same number of poisning points. The number of poisons are determined when our attack achieves 0% test accuracy on the subpopulation. Cluster 0-3 in the logistic regression and SVM models denote different clusters. For logistic regression, number of poisoning points for Cluster 0-3 are 1,575, 1,336 and 1,649 respectively. For SVM, number of poisoning points for Cluster 0-3 are 1,252, 1,268 and 1,179 respectively.

annex

poisoning points of our attack and the theoretical lower bound. Although our attack can induce the target models using very few poisoning points (recall that the entire training set size is 1,800), there is still some gap to the theoretical lower bound. We also repeated the same process on the model induced from the KKT attack and again observe gaps between the number of poisoning points and the corresponding lower bound. These suggest that there is still potentially room for improvement in finding more efficient model-targeted poisoning attacks.

F.2 ATTACKS ON LOGISTIC REGRESSION

In this section, we evaluate our attack on the logistic regression models for the Adult, MNIST 1-7 and Dogfish datasets.The convergence guarantee in the paper also holds for logistic regression model (more generally, holds for any Lipschitz and convex function with strongly convex regularizer). However, for logistic regression, we may not be able to efficiently search for the globally optimal point with maximum loss difference (Line 4 in Algorithm 1) because the difference of two logistic losses is not concave. Therefore, we adopt gradient descent strategy, using the Adam optimizer (Kingma & Ba, 2014) to search for the point that (approximately) maximizes the loss difference. This is in contrast to the SVM model, where the difference of Hinge loss is piece-wise linear and we can deploy general (convex) solvers to search for the globally optimal point in each linear segment (Diamond & Boyd, 2016; Inc, 2020) . However, as will be demonstrated next, poisoning points with approximate maximum loss difference can still be very effective. More formally, if the approximate maximum loss difference l found from local optimization techniques is within a constant factor from the globally optimal value l * (i.e., l ≥ αl * , 0 < α < 1), then we still enjoy similar convergence guarantees. A similar issue of global optimality also applies to the KKT attack (Koh et al., 2018) , where the attack objective function is no longer convex for logistic regression models, and therefore, we also utilize gradient based technique to (approximately) solve the optimization problem. Since the maximum loss difference found for logistic regression models may not be the globally optimal value, in these experiments we did not compute the lower bound (Theorem 4.2) on number of poisoning points to induce the poisoned model from our attack, which requires obtaining the actual maximum loss difference 3 .Target Classifiers. The clean accuracies of logistic regression models on the three datasets are 79.9% on Adult, 98.1% on MNIST 1-7 and 98.5% on Dogfish. Target classifiers for logistic regression models are generated similarly to their SVM counterpart on each dataset. For Adult dataset, the subpopulations are generated exactly the same as in the SVM case and form a total of 20 subpopulations, where instances in each subpopulation all belong to the "low-income" group. Among all the subpopulations, we select three with the highest test accuracy on the subpopulations and they all have 100% accuracy. The target classifiers are then generated to have 0% accuracy on the subpopulations using the target generation method described in Section 5. On the MNIST 1-7 dataset, target models are generated to have around 5%, 10% and 15% overall test errors and final test accuracies of the obtained target models are 94.7%, 89.0% and 84.5%. For Dogfish, target models are generated to have around 10%, 20% and 20% overall test errors and the final test accuracies of the resulting models are 89.0%, 79.5% and 67.3%.Results on Adult. Figure 10 shows the effectiveness of our attack on logistic regression models trained on the Adult dataset, using the loss-based distance and the actual model distance in 2 -norm. The maximum number of poisons for the experiments is obtained when the classifier from Algorithm 1 is 0.05-close to the target classifier. Our attack steadily converges to the target model while the KKT attack fails to have a reliable convergence. Similar observations are also found in other attack settings, as shown in Figure 11 . Our attack is much more successful than the KKT attack, especially for the attack on Cluster 1.Results on MNIST 1-7. Figure 12 shows the convergance of our attack on LR models trained on the MNIST 1-7. The maximum number of poisons for the experiments is obtained when the classifier from Algorithm 1 is 0.1-close to the target classifier. Our attack converges to the target model while

