TOWARDS FAIR CLASSIFICATION AGAINST POISONING AT-TACKS

Abstract

Fair classification aims to stress the classification models to achieve the equality (treatment or prediction quality) among different sensitive groups. However, fair classification can be under the risk of poisoning attacks that deliberately insert malicious training samples to manipulate the trained classifiers' performance. In this work, we study the poisoning scenario where the attacker can insert a small fraction of samples into training data, with arbitrary sensitive attributes as well as other predictive features. We demonstrate that the fairly trained classifiers can be greatly vulnerable to such poisoning attacks, with much worse accuracy & fairness trade-off, even when we apply some of the most effective defenses (originally proposed to defend traditional classification tasks). As countermeasures to defend fair classification tasks, we propose a general and theoretically guaranteed framework which accommodates traditional defense methods to fair classification against poisoning attacks. Through extensive experiments, the results validate that the proposed defense framework obtains better robustness in terms of accuracy and fairness than representative baseline methods.

1. INTRODUCTION

Data poisoning attacks (Biggio et al., 2012; Chen et al., 2017; Steinhardt et al., 2017) have brought huge safety concerns for machine learning systems that are trained on data collected from public resources (Konečnỳ et al., 2016; Weller & Romney, 1988) . For example, the studies (Biggio et al., 2012; Mei & Zhu, 2015; Burkard & Lagesse, 2017; Steinhardt et al., 2017) have shown that an attacker can inject only a small fraction of fake data into the training pool of a classification model and intensely degrade its accuracy. As countermeasures against data poisoning attacks, there are defense methods (Steinhardt et al., 2017; Diakonikolas et al., 2019) which can successfully identify the poisoning samples and sanitize the training dataset. Recently, in addition to model safety, people have also paid significant attention to fairness. They stress that machine learning models should provide the "equalized" treatment or "equalized" prediction quality among groups of population (Hardt et al., 2016; Agarwal et al., 2018; Donini et al., 2018; Zafar et al., 2017) . Since the fair classification problems are human-society related, it is highly possible that training data is provided by humans, which can cause high accessibility for adversarial attackers to inject malicious data. Therefore, fair classification algorithms are also prone to be threatened by poisoning attacks. Since fair classification problems have distinct optimization objectives & optimization processes from traditional classification, a natural question is: Can we protect fair classification from data poisoning attacks? In other words, are existing defenses sufficient to defend fair classification models? To answer these questions, we first conduct a preliminary study on Adult Census Dataset to explore whether existing defenses can protect fair classification algorithms (see Section 3.2). In this work, we focus on representative defense methods including k-NN Defense (Koh et al., 2021) and SEVER (Diakonikolas et al., 2019) . To fully exploit their vulnerability to poisoning attacks, we introduce a new attacking algorithm F-Attack, where the attacker aims to cause the failure of fair classification. In detail, by injecting the poisoning samples, the attacker aims to mislead the trained classifier such that it cannot achieve good accuracy, or not satisfy the fairness constraints. From the preliminary results, we find that both k-NN defense and SEVER will have an obvious accuracy or fairness degradation after attacking. Moreover, we also compare F-Attack with one of the strongest poisoning attacks Min-Max Attack Steinhardt et al. (2017) (which is devised for traditional classification). The result demonstrates that our proposed F-Attack has a better attacking effect compared to Min-Max Attack. In In this paper, we further propose a defense framework, Robust Fair Classification (RFC), to improve the robustness of fair classification against poisoning attacks. Different from existing defenses, our method aims to scout abnormal samples from each individual sensitive subgroup in each class. To achieve this goal, RFC first applies the similar strategy as the works (Diakonikolas et al., 2017; 2019) , to find abnormal data samples which significantly deviate from the distribution of other (clean) samples. Based on a theoretical analysis, we verify that RFC can exclude more poisoning samples than clean samples in each step. Moreover, to further avoid removing too many clean samples, we introduce an Online-Data-Sanitization process: in each iteration, we remove a possible poisoning set from a single subgroup of a single class and test the retrained models' performance on a clean validation set. This helps us locate the subgroup which contains the most poisoning samples. Through extensive experiments on two benchmark datasets, Adult Census Dataset and COMPAS, we validate the effectiveness of our defense. Our key contributions are summarized as: • We devise a strong attack method to poison fair classification and demonstrate the vulnerability of fair classification under the protection of traditional defenses to poisoning attacks. • We propose an efficient, and principled framework, Robust Fair Classification (RFC). Extensive experiments and theoretical analysis demonstrate the effectiveness and reliability of the proposed framework.

2. PROBLEM STATEMENT AND NOTATIONS

In this section, we formally define the setting of our studied problem and necessary notations. Fair Classification. In this paper, we focus on the classification problems which incorporate group-level fairness criteria. First, let x ⊆ R d be a random vector denoting the (non-sensitive) features, with a label y ∈ Y = {Y 1 , ..., Y m } with m classes, a sensitive attribute z ∈ Z = {Z 1 , Z 2 , ..., Z k } with k groups. Let f (X, w) represent a classifier with parameters w ∈ W. Then, a fair classification problem can be defined as: min w E l(f (x, w), y) s.t. g j (w) ≤ τ, ∀j ∈ Z (1) where the function E[l(•)] is the expected loss on test distribution, and τ is the unfairness tolerance. The constraint function g j (w) = E h(w, x, y)|z = j represents the desired fairness metric for each group j ∈ Z. For example, in binary classification problems, we use f (x, w) > 0 to indicate a positive classification outcome. Then, h(w, x, y) = 1(f (x, w) > 0) -E 1(f (x, w) > 0) refers to equalized positive rates in the equalized treatment criterion (Mehrabi et al., 2021) . Similarly, h(w, x, y) = 1(f (x, w) > 0|y = ±1) -E 1(f (x, w) > 0|y = ±1) is for equalizing true / false positive rates in the equalized odds (Hardt et al., 2016) . Given any training dataset D, we define the empirical loss function as L(D, w) as the average loss value of the model, and g j (D, w) is the empirical fairness constraint function. In our paper, we assume that the clean training samples are sampled from the true distribution D following the density P(x, y, z). We also use D u,v to denote the distribution of (clean) samples given by y = Y u and z = Z u , which has a density P(x, y, z|y = Y u , z = Z v ). Poisoning Attack (ϵ-poisoning model). In our paper, we consider the poisoning attack following the scenario. Given a fair classification task, the poisoned dataset is generated as follows: first, n clean samples D C = {(x i , y i , z i )} n i=1 are drawn from D to form the clean training set. Then, an adversary is allowed to insert an ϵ fraction of D C with arbitrary choices of D P = {(x i , y i , z i )} ϵn i=1 . We can define such a poisoned training set D C ∪ D P as ϵ-poisoning model.

3. FAIR CLASSIFICATION IS VULNERABLE TO POISONING ATTACKS

In this section, we first introduce the algorithm of our proposed attack F-Attack under the ϵ-poisoning model. Then, we conduct empirical studies to evaluate the robustness of fair classification algorithms (and popular defenses) against F-Attack and baseline attacks.

