TOWARDS FAIR CLASSIFICATION AGAINST POISONING AT-TACKS

Abstract

Fair classification aims to stress the classification models to achieve the equality (treatment or prediction quality) among different sensitive groups. However, fair classification can be under the risk of poisoning attacks that deliberately insert malicious training samples to manipulate the trained classifiers' performance. In this work, we study the poisoning scenario where the attacker can insert a small fraction of samples into training data, with arbitrary sensitive attributes as well as other predictive features. We demonstrate that the fairly trained classifiers can be greatly vulnerable to such poisoning attacks, with much worse accuracy & fairness trade-off, even when we apply some of the most effective defenses (originally proposed to defend traditional classification tasks). As countermeasures to defend fair classification tasks, we propose a general and theoretically guaranteed framework which accommodates traditional defense methods to fair classification against poisoning attacks. Through extensive experiments, the results validate that the proposed defense framework obtains better robustness in terms of accuracy and fairness than representative baseline methods.

1. INTRODUCTION

Data poisoning attacks (Biggio et al., 2012; Chen et al., 2017; Steinhardt et al., 2017) have brought huge safety concerns for machine learning systems that are trained on data collected from public resources (Konečnỳ et al., 2016; Weller & Romney, 1988) . For example, the studies (Biggio et al., 2012; Mei & Zhu, 2015; Burkard & Lagesse, 2017; Steinhardt et al., 2017) have shown that an attacker can inject only a small fraction of fake data into the training pool of a classification model and intensely degrade its accuracy. As countermeasures against data poisoning attacks, there are defense methods (Steinhardt et al., 2017; Diakonikolas et al., 2019) which can successfully identify the poisoning samples and sanitize the training dataset. Recently, in addition to model safety, people have also paid significant attention to fairness. They stress that machine learning models should provide the "equalized" treatment or "equalized" prediction quality among groups of population (Hardt et al., 2016; Agarwal et al., 2018; Donini et al., 2018; Zafar et al., 2017) . Since the fair classification problems are human-society related, it is highly possible that training data is provided by humans, which can cause high accessibility for adversarial attackers to inject malicious data. Therefore, fair classification algorithms are also prone to be threatened by poisoning attacks. Since fair classification problems have distinct optimization objectives & optimization processes from traditional classification, a natural question is: Can we protect fair classification from data poisoning attacks? In other words, are existing defenses sufficient to defend fair classification models? To answer these questions, we first conduct a preliminary study on Adult Census Dataset to explore whether existing defenses can protect fair classification algorithms (see Section 3.2). In this work, we focus on representative defense methods including k-NN Defense (Koh et al., 2021) and SEVER (Diakonikolas et al., 2019) . To fully exploit their vulnerability to poisoning attacks, we introduce a new attacking algorithm F-Attack, where the attacker aims to cause the failure of fair classification. In detail, by injecting the poisoning samples, the attacker aims to mislead the trained classifier such that it cannot achieve good accuracy, or not satisfy the fairness constraints. From the preliminary results, we find that both k-NN defense and SEVER will have an obvious accuracy or fairness degradation after attacking. Moreover, we also compare F-Attack with one of the strongest poisoning attacks Min-Max Attack Steinhardt et al. (2017) (which is devised for traditional classification). The result demonstrates that our proposed F-Attack has a better attacking effect compared to Min-Max Attack. In

