TOWARDS FAIR CLASSIFICATION AGAINST POISONING AT-TACKS

Abstract

Fair classification aims to stress the classification models to achieve the equality (treatment or prediction quality) among different sensitive groups. However, fair classification can be under the risk of poisoning attacks that deliberately insert malicious training samples to manipulate the trained classifiers' performance. In this work, we study the poisoning scenario where the attacker can insert a small fraction of samples into training data, with arbitrary sensitive attributes as well as other predictive features. We demonstrate that the fairly trained classifiers can be greatly vulnerable to such poisoning attacks, with much worse accuracy & fairness trade-off, even when we apply some of the most effective defenses (originally proposed to defend traditional classification tasks). As countermeasures to defend fair classification tasks, we propose a general and theoretically guaranteed framework which accommodates traditional defense methods to fair classification against poisoning attacks. Through extensive experiments, the results validate that the proposed defense framework obtains better robustness in terms of accuracy and fairness than representative baseline methods.

1. INTRODUCTION

Data poisoning attacks (Biggio et al., 2012; Chen et al., 2017; Steinhardt et al., 2017) have brought huge safety concerns for machine learning systems that are trained on data collected from public resources (Konečnỳ et al., 2016; Weller & Romney, 1988) . For example, the studies (Biggio et al., 2012; Mei & Zhu, 2015; Burkard & Lagesse, 2017; Steinhardt et al., 2017) have shown that an attacker can inject only a small fraction of fake data into the training pool of a classification model and intensely degrade its accuracy. As countermeasures against data poisoning attacks, there are defense methods (Steinhardt et al., 2017; Diakonikolas et al., 2019) which can successfully identify the poisoning samples and sanitize the training dataset. Recently, in addition to model safety, people have also paid significant attention to fairness. They stress that machine learning models should provide the "equalized" treatment or "equalized" prediction quality among groups of population (Hardt et al., 2016; Agarwal et al., 2018; Donini et al., 2018; Zafar et al., 2017) . Since the fair classification problems are human-society related, it is highly possible that training data is provided by humans, which can cause high accessibility for adversarial attackers to inject malicious data. Therefore, fair classification algorithms are also prone to be threatened by poisoning attacks. Since fair classification problems have distinct optimization objectives & optimization processes from traditional classification, a natural question is: Can we protect fair classification from data poisoning attacks? In other words, are existing defenses sufficient to defend fair classification models? To answer these questions, we first conduct a preliminary study on Adult Census Dataset to explore whether existing defenses can protect fair classification algorithms (see Section 3.2). In this work, we focus on representative defense methods including k-NN Defense (Koh et al., 2021) and SEVER (Diakonikolas et al., 2019) . To fully exploit their vulnerability to poisoning attacks, we introduce a new attacking algorithm F-Attack, where the attacker aims to cause the failure of fair classification. In detail, by injecting the poisoning samples, the attacker aims to mislead the trained classifier such that it cannot achieve good accuracy, or not satisfy the fairness constraints. From the preliminary results, we find that both k-NN defense and SEVER will have an obvious accuracy or fairness degradation after attacking. Moreover, we also compare F-Attack with one of the strongest poisoning attacks Min-Max Attack Steinhardt et al. (2017) (which is devised for traditional classification). The result demonstrates that our proposed F-Attack has a better attacking effect compared to Min-Max Attack. In conclusion, our preliminary study highlights the vulnerability of fair classification against poisoning attacks, especially against F-Attack. In this paper, we further propose a defense framework, Robust Fair Classification (RFC), to improve the robustness of fair classification against poisoning attacks. Different from existing defenses, our method aims to scout abnormal samples from each individual sensitive subgroup in each class. To achieve this goal, RFC first applies the similar strategy as the works (Diakonikolas et al., 2017; 2019) , to find abnormal data samples which significantly deviate from the distribution of other (clean) samples. Based on a theoretical analysis, we verify that RFC can exclude more poisoning samples than clean samples in each step. Moreover, to further avoid removing too many clean samples, we introduce an Online-Data-Sanitization process: in each iteration, we remove a possible poisoning set from a single subgroup of a single class and test the retrained models' performance on a clean validation set. This helps us locate the subgroup which contains the most poisoning samples. Through extensive experiments on two benchmark datasets, Adult Census Dataset and COMPAS, we validate the effectiveness of our defense. Our key contributions are summarized as: • We devise a strong attack method to poison fair classification and demonstrate the vulnerability of fair classification under the protection of traditional defenses to poisoning attacks. • We propose an efficient, and principled framework, Robust Fair Classification (RFC). Extensive experiments and theoretical analysis demonstrate the effectiveness and reliability of the proposed framework.

2. PROBLEM STATEMENT AND NOTATIONS

In this section, we formally define the setting of our studied problem and necessary notations. Fair Classification. In this paper, we focus on the classification problems which incorporate group-level fairness criteria. First, let x ⊆ R d be a random vector denoting the (non-sensitive) features, with a label y ∈ Y = {Y 1 , ..., Y m } with m classes, a sensitive attribute z ∈ Z = {Z 1 , Z 2 , ..., Z k } with k groups. Let f (X, w) represent a classifier with parameters w ∈ W. Then, a fair classification problem can be defined as: min w E l(f (x, w), y) s.t. g j (w) ≤ τ, ∀j ∈ Z (1) where the function E[l(•)] is the expected loss on test distribution, and τ is the unfairness tolerance. The constraint function g j (w) = E h(w, x, y)|z = j represents the desired fairness metric for each group j ∈ Z. For example, in binary classification problems, we use f (x, w) > 0 to indicate a positive classification outcome. Then, h(w, x, y) = 1(f (x, w) > 0) -E 1(f (x, w) > 0) refers to equalized positive rates in the equalized treatment criterion (Mehrabi et al., 2021) . Similarly, h(w, x, y) = 1(f (x, w) > 0|y = ±1) -E 1(f (x, w) > 0|y = ±1) is for equalizing true / false positive rates in the equalized odds (Hardt et al., 2016) . Given any training dataset D, we define the empirical loss function as L(D, w) as the average loss value of the model, and g j (D, w) is the empirical fairness constraint function. In our paper, we assume that the clean training samples are sampled from the true distribution D following the density P(x, y, z). We also use D u,v to denote the distribution of (clean) samples given by y = Y u and z = Z u , which has a density P(x, y, z|y = Y u , z = Z v ). Poisoning Attack (ϵ-poisoning model). In our paper, we consider the poisoning attack following the scenario. Given a fair classification task, the poisoned dataset is generated as follows: first, n clean samples D C = {(x i , y i , z i )} n i=1 are drawn from D to form the clean training set. Then, an adversary is allowed to insert an ϵ fraction of D C with arbitrary choices of D P = {(x i , y i , z i )} ϵn i=1 . We can define such a poisoned training set D C ∪ D P as ϵ-poisoning model.

3. FAIR CLASSIFICATION IS VULNERABLE TO POISONING ATTACKS

In this section, we first introduce the algorithm of our proposed attack F-Attack under the ϵ-poisoning model. Then, we conduct empirical studies to evaluate the robustness of fair classification algorithms (and popular defenses) against F-Attack and baseline attacks.

3.1. F-ATTACK: POISONING ATTACKS FOR FAIR CLASSIFICATION

Given a specific fair classification task, we consider that the attacker aims to contaminate the training set, such that applying existing algorithms cannot successfully fulfill the fair classification goal. Note that for fair classification tasks, both accuracy and fairness are the desired properties and they always have strong tension in practice Menon & Williamson (2018) . Therefore, in our attack, we consider misleading the training algorithms such that at least one of the two criteria is unsatisfied. Formally, we define the attacker's objective as Eq.( 2), where the attacker inserts a poisoning set D P with size ϵn in the feasible injection space F to achieve: max D P ⊆F E l(f (x, w * ), y) s.t. w * = arg min w∈H f air L(D C ∪ D P , w). (2) It means for the classifier w * that trained on D C ∪ D p and has a low empirical loss L(D C ∪ D p , w * ), if it falls in the space H f air , it will have a large expected loss (on the test distribution D). Here, H f air is the space of models (with a limited norm) that satisfy the fairness criteria on clean distribution D. To have a closer look at Eq.( 2), we discuss case by case. Suppose we obtain w * by fair classification on the set D C ∪ D P , there are cases: 1. w * / ∈ H f air : The fairness criteria (on the test set) is not satisfied. 2. w * ∈ H f air : Since w * is trained on DC ∪ DP to have low L(Dc ∪ Dp, w * ), w * will have high test error. For each case, the model w * will either have an unsatisfactory accuracy or unsatisfactory fairness. Next, we simplify the objective and constraints in Eq.( 2) to transform it into a solvable problem. We first conduct relaxations of the objective in Eq.( 2) (similar to the works (Steinhardt et al., 2017; Koh et al., 2021) ): E l(f (X, w), Y ) (i) ≈ L(D C , w) (ii) ≤ L(D C , w) + ϵL(D P , w) = (1 + ϵ)L(D C ∪ D P , w) Specifically, approximation (i) holds if clean training data D C has sufficient samples and is close to the test distribution D, and model w is appropriately regularized. The upper bound (ii) holds because of the non-negativity of loss values. The upper bound (ii) can be tight if the fraction of poisoning samples ϵ is small. Thus, we transfer Eq.( 2) to a bi-level optimization problem between w and D P . If the model f (•) and loss l(•) are convex, we can further swap them to get a min-max form as:  Our proposed F-Attack is to solve Eq.(3) which is shown in Algorithm 1. It solves a saddle point problem to alternatively find the worst attack points (x, y, z) w.r.t the current model and then update the model in the direction of the attack point. In detail, in Step (1), given the current model w, we solve the inner maximization problem to maximize L(D C ∪ D P , w). It is equal to finding sample (x, y, z) with the maximal loss: Adult Census Dataset. In this subsection, we conduct an experiment on Adult Census Dataset Kohavi et al. (1996) , to test whether F-Attack can poison fair classification methods and whether existing defense methods can resist F-Attack. Here, we focus on the fairness criteria: Equalized True Positive Rate (TPR) Hardt et al. (2016) between the genders, and we apply the constrained optimization method Donini et al. (2018) to train linear classifiers to fulfill the fair classification objective. It is worth mentioning that, this dataset contains many categorical features, such as marital-status, occupation, etc. For simplicity, we pre-process the dataset by transforming categorical features into a continuous space that is spanned by the first 15 principle directions of training (categorical) data. More details of the pre-processing procedure can be found in Appendix A.2. max D P ⊆F L(D C ∪ D P , w) = L(D C , w) + ϵ • max (x,y,z)∈F l(f (x, w), y). Defense Methods. Besides naïve fair classification, we mainly consider two representative data-sanitization defenses, which are existing popular methods to defend against poisoning attacks: • k-NN Defense (Koh et al., 2021) . This method removes the samples that are far from their k nearest neighbors. In detail, the k-NN defense calculates the "abnormal" score as q i = ||x i -x (k,y) i || 2 , where x (k,y) i is the k-th nearest neighbor to sample x i in class y. In this paper, we set k = 5. • SEVER. Diakonikolas et al. (2019) This method aims to find abnormal samples by tracing abnormal gradients. In each iteration, we first train a fair classifier f (with fixed τ ) and calculate the gradient of loss w.r.t the weight w for each training sample (x i , y i ), and get the normalized gradient matrix Q = ∇ w l(f (x i , w), y i ) -1 n n j=1 ∇ w l(f (x j , w), y j ) i=1,..,n . SEVER flags the samples with large "abnormal" score q i = (Q i • v) 2 as abnormal samples, where v is the top right singular vector of Q. Intuitively, the "abnormal" samples make a great contribution to the variation of the gradient matrix Q, which suggests their gradients can significantly deviate from the gradients of other samples. Results. In our experiments, we insert 10% poisoning samples to the training set, and define the feasible injection set F to be {(x, y, z) : ||x -µ y || ≤ d}, where d is a fixed radius. This will constrain the inserted samples not too far from the center of their labeled class, to evade potential defense. Since F is not related to sensitive attribute z, during F-Attack, we generate poisoning samples with a fixed z to be 0 (female) or 1 (male). During fairness training, we train multiple models with various hyperparameters to control the unfairness tolerance on the training set (following (Lamy et al., 2019) ). Then, we report the test performance when it has the best validation performance (which considers both accuracy and fairness, see Section 4, Eq.( 9) for more details). In Table1, we report the performancefoot_0 for the defense methods. From the result, we can see: all training methods have a significant performance degradation under F-Attack. For example, under F-Attack (z = 0), the SEVER defense has ≈ 4% accuracy drop and 2% fairness drop. This suggests that defenses such as SEVER and k-NN can be greatly vulnerable to poisoning attacks in fair classification. Moreover, we compare F-Attack with a baseline attack method Min-Max (Steinhardt et al., 2017; Koh et al., 2021) , which is one of the strongest attacks for traditional classification. It also solves Eq.(3) but does not constrain w ∈ H f air . From Table 1 , we can see that Min-Max has worse attacking performance than F-Attack, by causing slighter performance degradation. This result highlights the threat of F-Attack to fair classification. Discussion. To have a deeper understanding on the behavior of F-Attack, in Figure 1 , we visualize the clean samples and poisoning samples (via F-Attack and Min-Max Attack) in a 2-dim projected space (via PCA). From the figure, we can see that: compared to Min-Max Attack (red), the samples obtained by F-Attack (yellow) have a smaller distance to the clean samples in their labeled class y = 1, although they are constrained in the same feasible injection set F. It is because F-Attack aims to find samples with maximal loss (Eq.( 4)) for fair classifiers, so the generated samples do not have the maximal loss for traditional classifiers. Thus, the poisoning samples from F-Attacks are closer to their labeled class. This fact helps explain why F-Attack is more insidious than Min-Max Attack under the detection of traditional defenses, such as SEVER. In Appendix A.4, we further conduct experiments on Adult Census Dataset focused on a Neural Network based classification model, which demonstrates that the idea of F-Attack has potential to can be adapted to DNN models.

4. ROBUST FAIR CLASSIFICATION (RFC)

Motivated by studies in Section 3, new defenses are desired to protect fair classification against poisoning attacks, especially against F-Attack. In this section, we first introduce a novel defense framework called Robust Fair Classification (RFC), and we provide a theoretical study to further understand the mechanism of RFC. In Section 5, we conduct empirical studies to validate the robustness of RFC in practice.

4.1. ROBUST FAIR CLASSIFICATION (RFC)

Figure 2 : PCA visualization in (1, 0) Based on the discussion in Section 3, F-Attack can evade traditional defenses such as SEVER and k-NN Defense, because the generated poisoning samples are close to the clean data of their labeled class. However, we assume that they may deviate from the distribution of clean samples in their labeled subgroup (the data distribution D y,z given y and z). Refer to the Figure 2 , which shows the location of poisoning samples generated via F-Attack (z = 0) and clean samples in subgroup (y = 1, z = 0) in the 2D projected space. It suggests that the poisoning samples greatly contaminate the information of distribution P(x, y, z|y = 1, z = 0) in the training data. Thus, the injected poisoniend samples will not only confuse the original prediction task from x to y, but also greatly disturb the fairness constraints (Eq.( 1)). This observation motivates us to propose a new defense method that can scout abnormal samples from each individual subgroup in each class. Next, we will introduce the details of our proposed defense RFC. Centered Data Matrix & Poisoning Score. Our method shares a similar high-level idea as (Diakonikolas et al., 2017; 2019) , to find data points that systematically deviate from the distribution of other (clean) samples. Specifically, in our method, given a (poisoned) dataset D, we repeatedly scout the poisoning samples from each subgroup D(x, y, z|y = Y u , z = Z v ), where we use (u, v) to denote the index of each subgroup and class. In the later parts, we use D u,v to denote the samples in D(x, y, z|y = Y u , z = Z v ) for simplicity. Then, we define: Q u,v =   x i - 1 n u,v nu,v j=1 x j   (xi,yi,zi)∈D u,v , to be the centered data matrix of training samples D u,v and n u,v is the size of the set D u,v . For each Q u,v , the top right singular vector V u,v of Q u,v is the direction which explains the variation of the data distribution in (Y u , Z u ). Similar to the studies in Diakonikolas et al. (2017; 2019) , we conjecture: the poisoning samples are deviated from clean samples, so they will take the major responsibility for the variation of the data matrix Q u,v . In this way, they will have high alignments with the direction of V u,v . Thus, we define the Poisoning Score q u,v for each training sample (x i , y i , z i ) in the D u,v as: q u,v (x i ) = Q u,v i • (V u,v ) T . Notably, the poisoned samples are likely to have the same or opposite direction with the top right singular vector V u,v , but the poisoning samples should share the same direction. Thus, in our method, we define two Proposed Poisoning Sets for each (u, v), so that one of the two sets is likely to have poisoning samples: P u,v + = {x i |q u,v (x i ) > γ + , q u,v (x i ) > 0}; P u,v -= {x i | -q u,v (x i ) > γ -, q u,v (x i ) < 0}; In Eq.( 7), we set the γ + (or γ -) to be the q-th (q = 90) percentile of the given all Poisoning Scores (or negative poisoning Scores) in D u,v , so that each proposed poisoning set only contains a small portion of D u,v . In practice, we will repeatedly test whether removing the proposed poisoning sets can help improve the retrained model's performance (both accuracy and fairness) on a clean validation set. It helps to decide whether the proposed poisoning set contains poisoning samples. In Section 4.2, we will further conduct a theoretical analysis to show that the poisoning samples are more likely to have higher poisoning scores. Fair Classification by Excluding Poisoning Set. Finally, given a (poisoned) training set D as well as the proposed poisoning sets, we can keep retraining fair classifiers by excluding proposed poisoning sets: w + = arg min w∈W L(f (X, w), Y ; D \ P u,v + ) s.t. g j (D \ P u,v + ; w) ≤ τ j , ∀j ∈ Z w -= arg min w∈W L(f (X, w), Y ; D \ P u,v -) s.t. g j (D \ P u,v -; w) ≤ τ j , ∀j ∈ Z (8) In practice, our proposed RFC method repeatedly proposes potential poisoning sets for each individual subgroup in each class D u,v until finding the best poisoning set among all choices. The Algorithm 2 provides the detailed introduction of the procedure of RFC, which is an Online Data Sanitization process. Specifically, during each iteration of RFC, for each D u,v , we first calculate the poisoning scores and the proposed poisoning sets (Step (1)&( 2)). Then, we remove the proposed poisoning set from the dataset D, and conduct fair classification on D (Step (3)). In Step (4), we evaluate the retrained classifier on a clean separated validation set and find the best-proposed poisoning set which results in the highest validation performance. Notably, we measure the validation performance by considering both the accuracy and fairness criteria, by defining: ValScore = Pr.(f (x, w) = y) - j∈Z λ • (g j (D Val , w) -τ ) + , where τ is a unfairness tolerance threshold and λ is a positive number (we set λ = 3 in this paper). The second term penalizes the models if some subgroups' unfairness violation is over τ . Finally, we remove the proposed poisoning set which results in the highest ValScore and conduct the next round of searching (Step (6)). Algorithm 2: Robust Fair Classification (RFC) -An Online Data Sanitization Algorithm Input :An ϵ-poisoning model with dataset D = {(x i , y i , z i )} i=1,2,...,n , Iterations T of RFC. Output :A fair classifier while t ≤ T do for Y u ∈ Y, Z v ∈ Z do 1. Get the Centered Data Matrix Q u,v and Poisoning Score q u,v i following Eq.( 5) and Eq.( 6) 2. Get the Proposed Poisoning Sets: P u,v + and P u,v -following Eq.( 7) 3. Conduct fair classification by removing Proposed Poisoning Set, via Eq.( 8) and get w u,v ± . 4. Record the performance to get ValScore on a separated (clean) validation set for each w u,v ± . end 5. Get the best proposed poisoning set P * which achieves the highest ValScore across all u, v and P u,v ± . 6. Removing the best proposed poisoning set from D and set D = D \ P * end Remarkably, it is also worth mentioning that the framework RFC is also possible to be extended to various model architectures, such as Deep Neural Networks (DNNs), for robust fair classification. For example, we can apply the Energy-based Out-of Distribution Detection Liu et al. (2020) to find the abnormal samples from each D u,v . Then, we follow a similar manner as RFC to propose poisoning sets and conduct fair classification. We will leave the study in DNNs for future exploration.

4.2. THEORETICAL ANALYSIS

In this subsection, we conduct a theoretical analysis to further help understand the behavior of RFC, especially to understand the role of calculating poisoning scores in finding poisoning samples. In particular, we consider a simple theoretical setting where the clean samples from each group D u,v follow a distinct Gaussian distribution D u,v ∼ N (µ u,v , Σ u,v ), with center and covariance matrix (µ u,v , Σ u,v ). In the following theorem, we will show: when there are (poisoning) samples that deviate from the clean samples of D u,v , by having a center µ which is far from the center of clean samples, they will have larger squared poisoning scores (Eq.6) than clean samples. Thus, the proposed poisoning sets (Eq.( 7)) are likely to contain more poisoning samples than clean samples. For simplicity, we use D and N (µ, Σ) to denote the clean distribution of a given group D u,v . Theorem 4.1. Suppose that a set of "clean" samples S good with size n are i.i.d sampled from distribution N (µ, Σ), where Σ ⪯ σ 2 I. There is a set of "bad" samples S bad with size n p = n/K, K > 1 and center ||µ p -µ|| 2 = d, d = γ • σ. Then, the average squared poisoning scores of clean samples and bad samples have a relationship: E i∈S bad q 2 (x i ) -E i∈S good q 2 (x i ) ≥ K -1 K + 1 • γ 2 -(K + 1) σ 2 Theorem 4.1 suggests that the difference between the average (squared) poisoning scores of S bad and S good is controlled by γ and the sample ratio K. Since K > 1, if γ 2 > (K+1) 2 K-1 (which suggests the poisoning samples are sufficiently far from clean distribution), we can get the conclusion that the difference is positive. Thus, removing samples with the highest positive (or lowest negative) poisoning scores (as Eq.( 7)) will help to eliminate more poisoning samples than clean samples. If γ 2 is small, the poisoning samples are close to the true distribution, which will cause the poisoning samples to have limited influence on the model performance. The detailed proof of Theorem 4.1 is deferred to Appendix A.1. In our algorithm of RFC, we alternatively check each proposed poisoning set and see whether removing it helps improve the retrained models' performance. This will also avoid removing too many clean samples. In Appendix A.5, we provide extra discussion on the precision of the detection of RFC, which demonstrates that RFC will not remove too many clean samples in practice.

5.1. EXPERIMENTAL SETUP

In this section, we conduct comprehensive experiments to validate the effectiveness of our proposed attack and defense, in two benchmark datasets, Adult Census Dataset and COMPAS Dataset. In this part, we only consider Equalized True Positive Rate (TPR) between different sensitive subgroups, which is optimized via the fair classification method (Donini et al., 2018) . When applying (Donini et al., 2018) , we train multiple models with various hyperparameters to control the unfairness tolerance on the training set (following (Lamy et al., 2019) ). Then, we report the test performance when it has the best validation performance (which considers both accuracy and fairness, see Section 4, Eq.( 9)). In Appendix A.3, we provide additional results for a different type of fairness "Equalized Treatment", and a different fair classification method (Zafar et al., 2017) . The implementation can be found at https://anonymous.4open.science/r/f_attack-4017/. Attacks: We consider that the training set can be contaminated by: Label Flipping (Paudice et al., 2018) , and Sensitive Attribute Flipping (Wang et al., 2020) . We also consider the attack methods, Min-Max and F-Attack, which are introduced in Section 3. Notably, for each method, we assume the poisoning samples are constrained in the sample feasible injection set F = {(x, y, z) : ||x -µ y || ≤ d}, which limit the poisoning samples' distance to the class center. Thus, for Min-Max and F-Attack, we assign the generated samples to have a pre-defined sensitive attribute z = 0 or z = 1. Furthermore, we introduce an additional attack method "F-Attack * " which has the same algorithm with F-Attack but have a different feasible injection set: F = {(x, y, z) : ||x -µ y,z || ≤ d}, where µ y,z is the center of the group D y,z . Because this feasible injection set is related to the sensitive attribute z, we don't need to pre-define z during F-attack * . Remarkably, this attack aims to test the robustness of RFC, because the major goal of RFC is to find samples in each group D y,z which are far from µ y,z . Thus, F-Attack * is possible to evade RFC by constraining the poisoning samples' distance to µ y,z . In Appendix A.3, we also report the performance of all attacks & defenses under different choices of radius d. Baseline Defenses. To validate the effectiveness of RFC, we include baseline defense methods: (1) the naive method which does not apply any defense strategies; (2) SEVER (Diakonikolas et al., 2019) , which are representative defenses for traditional classification tasks. We apply (Donini et al., 2018) on the sanitized dataset by SEVER. In addition, we also include (3) Roh et al. (2020) , which is a method to defend fair classification methods against label flipping attacks. It leverages adversarial training strategy Zhang et al. (2018) ; and (4) the method (Wang et al., 2020) applies Distributional Robust Opitmization (DRO) to improve robustness when labels and sensitive attributes are contaminated. For baseline methods, we report their performance with the choice of hyperparameter that achieves the optimal ValScore (Eq.( 9)) on a clean validation set.

5.2. EXPERIMENTAL RESULTS

Adult Census Dataset. We first show the results in Adult Census Dataset in Table 2 . To further guarantee that the comparison between different defenses is fair, we use a balanced clean training dataset where each class has an equal number of samples since the baseline methods such as SEVER can be affected by class imbalance. Under this dataset, we set the desired fairness criteria to be |TPR(z = 0) -TPR(z = 1)| < 0.05. In Table 2 , we also mark the cases (with brown color) when the algorithms output models with much poorer fairness than the desired fairness. From Table 2 , we can see that RFC can achieve good accuracy & fairness among different types of dataset contamination. Especially, under strong attacks such as F-Attack, the accuracy and fairness are only slightly degraded after injecting poisoning samples. However, the baseline methods such as Wang et al. (2020) and Roh et al. (2020) , will have a clear performance (especially accuracy) degradation under F-Attack. Notably, the attack method F-Attack * , has similar attacking performance as F-Attack (z = 1). It is because under this dataset, F-Attack * also generates samples that have z = 1. COMPAS Dataset. In COMPAS dataset (Brennan et al., 2009) , we consider the same type of fairness criteria, which is Equalized TPR. In this dataset, we consider that the equity is desired among races, which are "Caucasian (z = 0), African-American (z = 1) and Hispanic (z = 2)", and follow the similar preprocessing procedure as that in Adult Census Dataset. In this dataset, the number of samples in the group "Hispanic" is much smaller than the other two groups. Thus, we only consider to inject poisoning samples to z = 0 or z = 1. In Table 3 , we report the performance of our studied attacks and defense, and we use 1 -max j∈Z |TPR(z = j) -T PR| to measure the "goodness" of fairness, where T PR is the averaged TPR in the whole dataset. During training, we set the desired fairness criteria to be max j∈Z |TPR(z = j) -T PR| ≤ 0.15. From the result in Table 3 , we can see that RFC is the only method that can consistently preserve the model accuracy and fairness after there are poisoning samples injected into the dataset.

6. RELATED WORKS

Poisoning Attacks. In this section, we introduce related work and discuss how this work differs from prior studies. Data poisoning attacks (Biggio et al., 2012) refer to the scenario that models are threatened by adversaries who insert malicious training samples, in order to take control of the trained model behavior (Li et al., 2020 ; Shafahi et al., 2018) . In this work, we concentrate on the untargeted poisoning attacks (Biggio et al., 2012; Koh et al., 2021) where the attacker aims to degrade the overall performance of the trained model. To defend against poisoning attacks, well-established methods (Wilcox, 2011; Rubinstein et al., 2009; Steinhardt et al., 2017; Diakonikolas et al., 2019; Tao et al., 2021; Wang et al., 2021b) are proposed to efficiently and effectively defend against poisoning attacks in various scenarios. This paper is within the scope of linear classification problems and we leave the studies in DNN models for future work. (ϵ = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F- Fair Classification. Fairness issues have recently drawn much attention from the community of machine learning. Fairness issues for common classification problems can be generally divided into two categories: (1) Equalized treatment Zafar et al. ( 2017) (or "Statistical Rate"); and (2) Equalized prediction quality (Hardt et al., 2016) . For classification models to satisfy these fairness criteria, popular methods including (Zafar et al., 2017; Donini et al., 2018; Agarwal et al., 2018) solve constrained optimization problems, and (Zhang et al., 2018) apply adversarial training (Madry et al., 2017) method. Comparison to Prior Works. There are recent works that try to test the robustness of fair classification methods by manipulating their training set. They also proposed possible strategies to defend the perturbations. For example, the works (Wang et al., 2020; Lamy et al., 2019; Celis et al., 2021a; b) (Paudice et al., 2018) . As countermeasures to defend against their proposed perturbations, representative works such as (Roh et al., 2020) proposed an adversarial training framework (Zhang et al., 2018) , to train the model to distinguish clean samples and poisoning samples, while preserving the model fairness. The work (Wang et al., 2020) solves robust optimization problems by assigning soft sensitive attributes. In our work, in terms of attack, we consider a stronger attacker because he/she can insert sophisticatedly calculated features and sensitive attributes, to fully exploit the vulnerability of fairness training methods.

7. CONCLUSION

In this work, we study the problem of poisoning attacks on fair classification problems. We propose a strong attack method that can evade the defense of most existing methods. Then, we propose an effective strategy to greatly improve the robustness of fair classification methods. In the future, we aim to examine if our findings can be generalized to other machine learning tasks, and other machine learning models, such as Deep Neural Networks (DNNs). and the difference between two averaged scores: (ϵ = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F- (ϵ = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F- Ei∈S bad -Ei∈S good ≥ k -1 k + 1 • d 2 -(k + 1)σ 2 = K -1 K + 1 • γ 2 -(K + 1) σ 2 A.2 MORE EXPERIMENTAL DETAILS In this part, we provide additional experimental details such as the pre-process procedure. Adult Cenesus Dataset. In this dataset, we have 5 numerical features "age, education-num, hours-per-week, capital-loss and capital gain"', and we also use categorical features such as "workclass, education, marital-stataus, occupation, relationship, race". For simplicity, we first transform the categoircal features into dummy variables and conduct Principle Component Analysis to project them to the space, which is spanned by the first 15 principle components. After projection, we normalize all 20 features by centering and standadizing. Under this dataseet, during the attacking of Min-Max and F-Attack, we assign the radius of the feasible injection set to be d = 9.0. In Appendix A.3, we provide the empirical results for more choices of d, i.e., d = 6.0. COMPAS Dataset. In this dataset, we have the numerical features: "age, age cat, juv fel count, juv misd count, juv other count, priors count, days b screening arrest, decile score, c jail in, c jail out". We use "c jail outc jail in" to get the number of days in jail and exclude c jail out, c jail in. We also have categorical features c charge degree and sex, and we use PCA to find the first two principle directions. Then, we standardize each feature. Under this dataset, during the attacking of Min-Max and F-Attack, we assign the radius of the feasible injection set to be d = 9.0. In Appendix A.3, we provide the empirical results for more choices of d, i.e., d = 6.0.

A.3 MORE EXPERIMENTAL RESULTS

In this part, we provide additional empirical results to validate the effectiveness of our attack and defense method. In detail, we consider more settings about: (1) Different Type of Fairness Criteria, such as "Equalized Treatment"; (2) Different fair classification method, such as (Zafar et al., 2017), (3) Different choice of the radius of feasible injection space d, such as d = 6.0. Notably, in the all experiments in the main paper, we set d with a fixed value d = 9.0. In this part, we provide another option when we choose a smaller d = 6.0. It is because larger d, i.e, d > 9.0 will make the poisoning samples easier to be detected by most defense methods. For fairness criteria under "Equalized Treatment" (Equalized Positive Rate (PR)) in Adult Census Dataset, we set the desired fairness criteria to be |PR(z = 0) -PR(z = 1)| < 0.2. For fairness criteria under "Equalized TPR" in Adult Census Dataset, we set the unfairness criteria to be |TPR(z = 0) -TPR(z = 1)| < 0.05. For fairness criteria under "Equalized TPR" and "Equalized Treatment" in COMPAS Dataset, we set the desired fairness criteria to be max j∈Z |TPR(z = j) -T PR| ≤ 0.15 and max j∈Z |PR(z = j) -PR| ≤ 0.15. A.4 GENERALIZING F-ATTACK TO DNN-BASED FAIR LEARNING In the paper, we have mainly discussed linear models and convex optimization problems. In this part, we argue that our method and conclusions are general and representative, which can easily extend to Deep Neural Networks (DNNs). To show this point, we focus on a representative and popular fair learning method in DNN, called Learning Fair Representations (LFR) (Zemel et al., 2013; Edwards & Storkey, 2015; Beutel et al., 2017; Wang et al., 2019) . Then, we propose a modified version of F-Attack, called F-Attack-DNN (FAD), which inherits the Learning Fair Representation (LFR). The goal of learning fair representations is to remove sensitive information from the learned representations, so that the sensitive information will not interfere the model prediction. In this paper, we focus on the method (Edwards & Storkey, 2015) to achieve the goal of LFR. It is an adversarial framework that learns the target classification task without the ability to predict sensitive information. Many fairness learning methods in DNN tasks follow the similar strategy to learn fair representations for images (Wang et al., 2019) and texts (Sarafianos et al., 2019) . Figure 3 illustrates the overview of the framework. In detail, a feature extractor h(•) first extracts latent information form the input. Then, two prediction heads f y (•) and f z (•) are attached to the extracted feature space to predict the label y and sensitive attribute z respectively. To ensure there is no sensitive information in the learned representation h(•), the training objective of LFR can be expressed as: (ϵ = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F- min f max fs E l (f (x), y) -α • l(f z (h(x)), z) Here, the function f (•) = f y (h(•)) is the overall classification model. This objective briefly resembles the idea of GAN (Goodfellow et al., 2020 ): In the inner maximization, f z (•) is updated to minimize l(f z (h(x)), z), which aims to explore helpful information from h(x) to predict the sensitive attribute z; In the outer minimization, the model f (•) and feature extractor h(•) are updated to break f z (•)'s ability to predict z, while preserve the ability to predict y. In this way, the model is trained to eliminate sensitive information from the extracted features. F-Attack-DNN (FAD). Based on the design of LFR, we propose F-Attack-DNN to generate poisoning samples to degrade the performance of LFR. It shares the similar overall optimization objective as in Eq.(3) in Section 3, which solves a min-max problem: min f ∈H f air max D P ⊆F L(D C ∪ D P , w) To solve this problem for DNN-based model f , we propose the following algorithm at Algo. 3 called F-Attack-DNN (FAD). Specifically, in the inner maximization, FAD first finds the sample which maximize loss value under current model f , which is similar to Eq.( 4) in Section 3. Notably, to solve this optimization problem for DNN-based model, we can apply well-known methods such as Projected Gradient Descent (Madry et al., 2017) . Then, Step (2) and Step (3) imitate the (re)-training process of LFR to update the model f . In detail, Step (2) updates the classifier f z to predict sensitive attribute z from the learned feature h(x); Step (3) solves the outer minimization for Eq.( 11). This will ensure f fall into the space H f air and f has a minimized training error. Notably, the only difference between F-Attack (Algo. 1) and F-Attack-DNN (Algo. 3) is about how the outer minimization problem is solved, which is how the fair classification model f is updated. In another word, for various fair learning schemes, we can adaptively modify the objective of F-Attack to accommodate the targeted fair learning methods. Next, we will show experimental results to demonstrate the effectiveness of FAD. Experimental Results. To demonstrate the effectiveness of our proposed attack, we again focus on the Adult Census Dataset and consider the fairness criteria as Equalized True Positive Rate (TPR) between genders. Under this dataset, we consider two layer Multi-Layer Perceptron (MLP) for feature extraction and classification. Then, the sensitive attribute discriminator f z (•) is attached onto the pen-ultimate layer of MLP. In our experiments, we insert various fraction (from 0% to 40%) of poisoning samples to the training set, and define the feasible injection set F to be {(x, y, z) : ||x-µ y || ≤ d}, where d = 9. In the Table 10 , we report the models' goodness of accuracy and fairness, for the models which are obtained via LFR without any defenses. Notably, we choose to report the average model performance (for 10 runs) on the checkpoint when it achieves best validation performance (see Section 4 Eq.9). From the result, we find that FAD can achieve much stronger attacking effectiveness than other baseline attacks, such as label flip, attribute flip and Min-Max Attack (MM) (Steinhardt et al., 2017) . Here, MM represents the attack algorithm similar to Algo. 3, but without step 2 and the term about f s in Step 3. This experimental result implies the generality of F-Attack and its potential to extend to various fairness training schemes. We will leave the study on fair learning for other DNN applications, such as in image domain and text domain for future investigation.

A.5 THE PRECISION OF THE DETECTION OF RFC

To have a deeper understanding of the defense method RFC, we check the calculated poisoning scores for all training samples in a poisoned Adult Census Dataset. In Figure 4 , we visualize the distribution of poisoning scores for clean samples, as well as poisoning samples from 4 versions of F-Attack. Specifically, we report the scores for each subgroup separately and we use red dots to denote the poisoning samples. From this figure, it is easy to see that the poisoning samples always appear at the tail in their labeled subgroup with (almost) lowest poisoning scores. Note that during our proposed defense method RFC, at each iteration, it only removes a small Fair. 0% 0.797 0.960 0.797 0.960 0.797 0.960 0.797 0.960 0.797 0.960 0.797 0.960 10% 0.781 0.972 0.795 0.960 0.774 0.970 0.778 0.960 0.757 0.936 0.775 0.954 20% 0.783 0.972 0.798 0.958 0.776 0.960 0.740 0.963 0.731 0.945 0.713 0.952 30% 0.766 0.967 0.795 0.958 0.724 0.977 0.692 0.967 0.676 0.947 0.644 0.965 40% 0.741 0.971 0.797 0.936 0.701 0.971 0.668 0.987 0.655 0.924 0.647 0.961 portion of samples from a particular subgroup with highest or lowest poisoning score. As a result, RFC will not remove too many clean samples until they can find the poisoning samples. Besides, one interesting fact is: under this Adult Census Dataset, F-Attack prefers to add poisoning samples with y = 1.



We report the "goodness of fairness" as 1 -Unfair, i.e., 1 -|TPR(z = 0) -TPR(z = 1)| in Table1.



C ∪ D P , w) → min w∈H f air max D P ⊆F L(D C ∪ D P , w)

Algorithm of F-Attack-DNN (FAD) Input :Clean data D C , number of poisoned samples ϵn, feasible set F, α > 0, η > 0, warm-up steps n burn . Output :A poisoning set D * P for t = 1, ..., n burn + ϵn do 1. Solve inner maximization: (x, y, z) = arg max (x,y,z)∈F l(f (x, w), y) and let D P = ϵn × {(x, y, z)} 2. Update the discriminator: fz= fz -η • 1 |D C ∪Dp| (x,y,z)∈D C ∪D P ∇ fz l(fz(h(x)), z) 3. Solve outer minimization: f = f -η • 1 |D C ∪Dp| • (x,y,z)∈D C ∪D P ∇ f l(f (x), y) -α • l(fz(h(x)), z) if t > n burn then D * P = D * P ∪ {(x, y, z)} end end

Figure 4: The Poisoning Score calculated by RFC for samples from each subgroup.

In Step (2), we update w to minimize L(D C ∪ D P , w). Note that in Step (2), we should also constrain the model w to fall into the fair model space H f air . Thus, when we update w in Step (2), we also penalize the fairness violation of w. Here, we calculate the fairness violation as (g j (D C , w) -τ ) + (with weight parameter λ > 0) on the clean set D C to approximate the fairness violation on real data D.

F-Attack vs. Min-Max on Adult Census Dataset

RFC & Baseline Methods' Performance against Poisoning Attacks on Adult Census Dataset.

RFC & Baseline Methods' Performance against Poisoning Attacks on COMPAS Dataset.

consider injecting naturally / adversarially generated noise only on sensitive attributes. Another line of researches Roh et al. (2020); Wang et al. (2021a) considers the vulnerability of fairness training to (coordinated) label-flipping attacks

RFC & Baseline Methods' Performance on Adult Census Dataset, for Equalized Treatment

RFC & Baseline Methods' Performance on Adult Census Dataset using(Zafar et al., 2017).

RFC & Baseline Methods' Performance on Adult Census Dataset when d = 6

RFC & Baseline Methods' Performance on COMPAS Dataset using(Zafar et al., 2017).

RFC & Baseline Methods' Performance on COMPAS dataset, when d = 6

F-Attack-DNN (FAD) vs. Baseline Attacks on Adult Census Dataset

A APPENDIX

A.1 PROOF OF THEOREM In this part, we provide the detailed proof of Theorem 4.1 in Section 4. Theorem A.1 (Recall Theorem 4.1). Suppose a set of "clean" samples S good with size n are i.i.d sampled from distribution N (µ, Σ), where Σ ⪯ σ 2 I. There is a set of "bad" samples S bad with size n p = n/K, K > 1 and center ||µ p -µ|| 2 = d, d = γ • σ. Then, the average squared poisoning scores of clean samples and bad samples have the relationship:Proof. We denote S good is the set of clean samples, S bad is the set of bad samples, and the union of clean samples and bad samples form the whole set S. In the later part, to distinguish between the centers of each set, we use µ g , µ p and µ s to denote the center of clean samples, bad samples and the whole set.First, it is easy to know that µ g , µ p and µ s are in a same line. Thus, given n g : n p = K : 1, we have the relationship of center distance:In the following, we will study the squared poisoning score in the whole group S. Given any unit vector V is the top right singular vector of the centered data matrix of S:Note that V is the top right singular vector of the centered data matrix X s -µ s , we choose v ′ = (µ p -µ s )/||(µ pµ s )||, which is the unit vector that has the same direction with (µ p -µ s ). We get:For the first term in the right hand side of the inequality above:because the poisoning scores are all positive. Then, we have:The first term is larger than 0 because of the semi-definite property of the matrix (X g -µ g )(X g -µ g ) T , the second term is because v ′ has the same direction with (µ p -µ s ) (because µ p , µ g and µ s are in the same line). Therefore, we get the average poisoning score of the whole set S:In the following, we will calculate the average squared poisoning score in the good set S good .Based on previous calculation about the average score of whole set and good set, we can get the average squared poisoning score in the bad set S bad :

