TO BE ROBUST OR TO BE FAIR: TOWARDS FAIRNESS IN ADVERSARIAL TRAINING

Abstract

Adversarial training algorithms have been proved to be reliable to improve machine learning models' robustness against adversarial examples. However, we find that adversarial training algorithms tend to introduce severe disparity of accuracy and robustness between different groups of data. For instance, PGD adversarially trained ResNet18 model on CIFAR-10 has 93% clean accuracy and 67% PGD l ∞ -8 adversarial accuracy 1 on the class "automobile" but only 59% and 17% on class "cat". This phenomenon happens in balanced datasets and does not exist in naturally trained models when only using clean samples. In this work, we theoretically show that this phenomenon can generally happen under adversarial training algorithms which minimize DNN models' robust errors. Motivated by these findings, we propose a Fair-Robust-Learning (FRL) framework to mitigate this unfairness problem when doing adversarial defenses and experimental results validate the effectiveness of FRL.

1. INTRODUCTION

The existence of adversarial examples (Goodfellow et al., 2014; Szegedy et al., 2013) causes huge concerns when applying deep neural networks on safety-critical tasks, such as autonomous driving vehicles and face identification (Morgulis et al., 2019; Sharif et al., 2016) . These adversarial examples are artificially crafted samples which do not change the semantic meaning of the natural samples, but can misguide the model to give wrong predictions. As countermeasures against the attack from adversarial examples, adversarial training algorithms aim to train classifier that can classify the input samples correctly even when they are adversarially perturbed. Namely, they optimize the model to have minimum adversarial risk of that a sample can be perturbed to be wrongly classified: min f E x max ||δ|≤ L(f (x + δ), y) These adversarial training methods (Kurakin et al., 2016; Madry et al., 2017; Zhang et al., 2019b) have been shown to be one type of the most effective and reliable ways to improve the model robustness against adversarial attacks. Although promising to improve model's robustness, recent studies show side-effects of adversarial training: it usually degrades model's clean accuracy (Tsipras et al., 2018) . In our work, we find a new intriguing property about adversarial training algorithms: they usually result in a large disparity of accuracy and robustness between different classes. As a preliminary study in Section 2, we apply natural training and PGD adversarial training (Madry et al., 2017) on the CIFAR10 dataset (Krizhevsky et al., 2009 ) using a ResNet18 (He et al., 2016) architecture. For a naturally trained model, the model performance in each class is similar. However, in the adversarially trained model, there is a severe performance discrepancy (both accuracy and robustness) of the model for data in different classes. For example, the model has high clean accuracy and robust accuracy (93% and 67% successful rate, separately) on the samples from the class "car", but much poorer performance on those "cat" images (59% and 17% successful rate). More preliminary results in Section 2 further show the similar "unfair" phenomenon from other datasets and models. Meanwhile, we find that this fairness issue does not appear in natural models which are trained on clean data. This fact demonstrates that adversarial training algorithms can indeed unequally help to improve model robustness for different data groups and unequally degrade their clean accuracy.



The model's accuracy on the input samples that have been adversarially perturbed.1

