TO BE ROBUST OR TO BE FAIR: TOWARDS FAIRNESS IN ADVERSARIAL TRAINING

Abstract

Adversarial training algorithms have been proved to be reliable to improve machine learning models' robustness against adversarial examples. However, we find that adversarial training algorithms tend to introduce severe disparity of accuracy and robustness between different groups of data. For instance, PGD adversarially trained ResNet18 model on CIFAR-10 has 93% clean accuracy and 67% PGD l ∞ -8 adversarial accuracy 1 on the class "automobile" but only 59% and 17% on class "cat". This phenomenon happens in balanced datasets and does not exist in naturally trained models when only using clean samples. In this work, we theoretically show that this phenomenon can generally happen under adversarial training algorithms which minimize DNN models' robust errors. Motivated by these findings, we propose a Fair-Robust-Learning (FRL) framework to mitigate this unfairness problem when doing adversarial defenses and experimental results validate the effectiveness of FRL.

1. INTRODUCTION

The existence of adversarial examples (Goodfellow et al., 2014; Szegedy et al., 2013) causes huge concerns when applying deep neural networks on safety-critical tasks, such as autonomous driving vehicles and face identification (Morgulis et al., 2019; Sharif et al., 2016) . These adversarial examples are artificially crafted samples which do not change the semantic meaning of the natural samples, but can misguide the model to give wrong predictions. As countermeasures against the attack from adversarial examples, adversarial training algorithms aim to train classifier that can classify the input samples correctly even when they are adversarially perturbed. Namely, they optimize the model to have minimum adversarial risk of that a sample can be perturbed to be wrongly classified: min f E x max ||δ|≤ L(f (x + δ), y) These adversarial training methods (Kurakin et al., 2016; Madry et al., 2017; Zhang et al., 2019b) have been shown to be one type of the most effective and reliable ways to improve the model robustness against adversarial attacks. Although promising to improve model's robustness, recent studies show side-effects of adversarial training: it usually degrades model's clean accuracy (Tsipras et al., 2018) . In our work, we find a new intriguing property about adversarial training algorithms: they usually result in a large disparity of accuracy and robustness between different classes. As a preliminary study in Section 2, we apply natural training and PGD adversarial training (Madry et al., 2017) on the CIFAR10 dataset (Krizhevsky et al., 2009 ) using a ResNet18 (He et al., 2016) architecture. For a naturally trained model, the model performance in each class is similar. However, in the adversarially trained model, there is a severe performance discrepancy (both accuracy and robustness) of the model for data in different classes. For example, the model has high clean accuracy and robust accuracy (93% and 67% successful rate, separately) on the samples from the class "car", but much poorer performance on those "cat" images (59% and 17% successful rate). More preliminary results in Section 2 further show the similar "unfair" phenomenon from other datasets and models. Meanwhile, we find that this fairness issue does not appear in natural models which are trained on clean data. This fact demonstrates that adversarial training algorithms can indeed unequally help to improve model robustness for different data groups and unequally degrade their clean accuracy. In this work we first define this problem as the unfairness problem of adversarial training algorithms. If this phenomenon happens in real-world applications, it can raise huge concerns about safety or even social ethics. Imagine that an adversarially trained traffic sign recognizer has overall high robustness, but it is very inaccurate and vulnerable to perturbations for some specific signs such as stop signs. The safety of this autonomous driving car is still not guaranteed. In such case, the safety of this recognizer depends on the worst class performance. Therefore, in addition to achieving overall performance, it is also essential to achieve fair accuracy and robustness among different classes, which can guarantee the worst performance. Meanwhile, this problem may also lead to the issues from social ethics perspectives, which are similar to traditional ML fairness problems (Buolamwini & Gebru, 2018) . For example, a robustly trained face identification system might provide different qualitative levels of service safety for different ethnic communities. In this paper, we first explore the potential reason which may cause this unfair accuracy / unfair robustness problem. In particular, we aim to answer the question -"Will adversarial training algorithms naturally cause unfairness problems, such as the disparity of clean accuracy and adversarial robustness between different classes?" To answer this question, we first propose a conceptual example under a mixture of two spherical Gaussian distributions which resembles to the previous work (Tsipras et al., 2018) but with different variances. In this setting, we hypothesize that adversarial training tends to only use robust features for model prediction, whose dimension is much lower than the non-robust feature space. In the lower dimensional space, an optimal linear model is more sensitive to the inherent data distributional difference and be biased when making predictions. Motivated by these empirical and theoretical findings, we then propose a Fair Robust Learning (FRL) framework to mitigate this unfairness issue, which is inspired from the traditional debiasing strategy to solve a series of cost-sensitive classification problems but we make specific effort to achieve the fairness goal in adversarial setting. Our main contributions can be summarized as following: (a) We discover the phenomenon of "unfairness" problem of adversarial training algorithms and implement empirical studies to present this problem can be general; (b) We build a conceptual example to theoretically investigate the main reasons that cause this unfairness problem; and (c) We propose a Fair Robust Learning (FRL) framework to mitigate the unfairness issue in adversarial setting. From the Figure 1 , we can observe that -for the naturally trained models, every class has similar clean accuracy (around 90 ± 5%) and adversarial accuracy (close to 0%) under the PGD attack. It suggests that naturally trained models do not have strong disparity of both clean and robustness performance among classes. However, for adversarially trained models (under PGD Adv. Training or TRADES), the disparity phenomenon becomes severe. For example, a PGD-adversarially trained model has 59.1% clean accuracy and 17.4% adversarial accuracy for the samples in the class "cat",



The model's accuracy on the input samples that have been adversarially perturbed.



In this section, we present our preliminary studies to show that adversarial training algorithms usually present the unfairness issues, which are related to the strong disparity of clean accuracy and robustness among different classes. We implement algorithms including PGD adversarial training (Madry et al., 2017) and TRADES (Zhang et al., 2019b) on the CI-FAR10 dataset (Krizhevsky et al., 2009). In CIFAR10, we both naturally and adversarially train ResNet18 (He & Garcia, 2009) models. In Figure 1, we present list the the model's accuracy and robustness performance (under PGD attack by intensity 4/255 and 8/255) for each individual class.

Figure 1: Clean and adversarial accuracy in each class of CIFAR10 dataset, from a naturally trained ResNet model (left), PGD-adversarially trained model (middle) and TRADES (right), against adversarial examples under l ∞ -norm by 8/255. The trained models' robustness are evaluated by untargeted PGD attack under l ∞ -norm constrained by 8/255 and 4/255.

